OCR-D / ocrd_froc

Apache License 2.0

7 stars 2 forks source link

greek confused with italic #5

Open bertsky opened 1 year ago

bertsky commented 1 year ago

I have some material with alternating lines of Latin in Antiqua and Old Greek (interlinear gloss) – the perfect test case IOW.

Unfortunately, the provided model systematically detects italic (with 100% confidence) where Greek should be.

So adaptive will always resort to the SelOCR result, which are wrong half of the time. And of course, when forcing COCR globally, because the OCR model does not have Greek trained into it, the results are not usable either.

GemCarr commented 1 year ago

We dont have enough training data on some classes, specifically 'greek', 'hebrew' and 'manuscript'. I will explain everything on the README. For the classifier outputting fonts, we will train it again with more data to solve the problem (then it should output 'greek' instead of the incorrect 'italic'). However, the transcriptions results will probably not be ideal. The COCR in theory can handle these classes but provides with poor results due to the lack of data (as you have seen). On SelOCR it will basically use an OCR trained on all the data we have available (all classes), when confronted with those specific lacking classes (which may be better than the specialized italic model in the case you presented).

bertsky commented 1 year ago

We dont have enough training data on some classes, specifically 'greek', 'hebrew' and 'manuscript'.

Understood. So could #7 help here? (Even with better training data, there might always be cases where the user observes systematic suboptimal detection and has a priori knowledge to throw in...)

However, the transcriptions results will probably not be ideal.

Yes, in general we might need to use ocrd-typegroups-classifier and combine that dynamically (in the workflow) with dedicated models from other OCR processors.

GemCarr commented 1 year ago

7 would definitely help, i will look into that

Its a possibility that we retrain the ocr models if we obtain more data for the lacking classes, if that is the case i will update the processor