UB-Mannheim / ocr-gt-tools

Ergonomic line-by-line transcription of scanned text.
GNU Affero General Public License v3.0
47 stars 11 forks source link

Integration of a help page/popup (for beginners) #6

Closed LetschM closed 8 years ago

LetschM commented 8 years ago

It would be nice to have some kind of help site or popup because most of the texts include some special characters like diacritics, crosses etc. which have to be typed manually by using Unicode or by copy/paste out of a character map (e.g. open office). In order to not have to look up how to type those signs in secondary resources every time they occur, a little lookup section could help. It could maybe look somehow like the example below: help-popup

LetschM commented 8 years ago

I could write a little manual, which can maybe somehow be integrated

zuphilip commented 8 years ago

I could also imagine to have a web-keyboard with these special characters which is above the website e.g. in the lower part:

sonderzeichen

If you would click on the image of a special character then this character is inserted in the textbox where the cursor is at the moment.

Maybe with the possibility to switch the web-keyboard on and off.

kba commented 8 years ago

We're starting to collect the letters for the help in the wiki in a formalized way, so we can scrape that page and generate help pages/contextual help

@stweil @zuphilip Does the process / the variables I set up in the wiki there make sense for you?

zuphilip commented 8 years ago

Some remarks:

zuphilip commented 8 years ago

Actually, Base Letter is also not clear for me: "A List of related letters, available in ASCII or Unicode (e.g. the unaccented version)". Well, then I can also directly write Æ (Unicode) or A, E, AE (ASCII). Moreover, it might be that the Unicode representation exists, e.g. ꝰ but this stands for something different, e.g. is, s.

IMO there some different things where it is important to say exactly what we are doing: a) Using the ASCII alphabet, or b) using the Unicode alphabet. Then we can either 1) use exactly the same letter of this alphabet (we might train these special letters for an ocr), or 2) transcribe it with the usual letter(s) of this alphabet (that is what the final user would like to see in the end). I guess it makes sense to make out here exactly two reasonable possibilities: a2 and b1.

Thus instead of Transliteration and Base Letter, I propose something like

Moreover, Transcription should be required but the representation might not exists. The hex code can be calculated from the representation and might not be needed separately.

LetschM commented 8 years ago

Do I see it right when I say the transcription is only relevant for ligatures, as diacritics should always be used like they are in the original? I added the unicode hex codes in a variable because you could in some cases need to type the letters manually (e.g. something with trema (except ö,ä,ü), for I dont know how to type an i with two points with the keyboard). IF we have something like a keyboard (like recommended above by Philipp) where we can just drag the letter into the text, then there's no problem. But as long as there is no such function, some characters best an fastest can be typed by strg+shift+u+xxxx.

kba commented 8 years ago

About Base Letter: My idea was, for diacritics (a with dots or accents) and scribal abbreviation (dashed q etc.), it's easy to input the letter these are based on (a, q) and then get a list of pertinent weird characters. Once clicking one of those characters, the Transliteration is copied to the input in question.

zuphilip commented 8 years ago

... but Base Letter can currently also be Unicode (according to the explanation in the wiki) and therefore we can also use letters with diacritics directly.

kba commented 8 years ago

The links to the pictures will afterwards change it seems unstable to link to a branch which we may delete in the future.

This is not a problem, the reference to ocr-characters can be replaced with Search&Replace once references get merged to master. The export script might pull the images as well.

Base Letter can currently also be Unicode (according to the explanation in the wiki)

Just to be clear here, I see two goals for this set: 1) Make it easy for the person transcribing a page to input weird glyphs. 2) The OCR engine training the right transliteration

1) should focus on the needs of transcribers which, for now, use a German 105-IBM keyboard layout. They have access to characters like Ä but not Æ. If they select A in an input box, both Ä and Æ should be proposed. Try searching for A in this page in Chrome to see what I mean, it ill highlight both Ä an Æ.

2) Should be the transliteration that we want the OCR engine to train. In most cases this will be a 1:1 mapping to Unicode (Ä), in other cases it might be a 1:n mapping (splitting Æ to AE) or an n:1 mapping (ſs to ß).

The hex code can be calculated ...

The transliteration should be exactly what we want the OCR engine to train. I see no reason to use hex codes here, unless we have problems with the wiki engine mangling Unicode or such. Otherwise this should indeed be the exact string the OCR engine should recognize encountering the glyph.