Open bertsky opened 1 year ago
We dont have enough training data on some classes, specifically 'greek', 'hebrew' and 'manuscript'. I will explain everything on the README. For the classifier outputting fonts, we will train it again with more data to solve the problem (then it should output 'greek' instead of the incorrect 'italic'). However, the transcriptions results will probably not be ideal. The COCR in theory can handle these classes but provides with poor results due to the lack of data (as you have seen). On SelOCR it will basically use an OCR trained on all the data we have available (all classes), when confronted with those specific lacking classes (which may be better than the specialized italic model in the case you presented).
We dont have enough training data on some classes, specifically 'greek', 'hebrew' and 'manuscript'.
Understood. So could #7 help here? (Even with better training data, there might always be cases where the user observes systematic suboptimal detection and has a priori knowledge to throw in...)
However, the transcriptions results will probably not be ideal.
Yes, in general we might need to use ocrd-typegroups-classifier and combine that dynamically (in the workflow) with dedicated models from other OCR processors.
Its a possibility that we retrain the ocr models if we obtain more data for the lacking classes, if that is the case i will update the processor
I have some material with alternating lines of Latin in Antiqua and Old Greek (interlinear gloss) – the perfect test case IOW.
Unfortunately, the provided model systematically detects italic (with 100% confidence) where Greek should be.
So
adaptive
will always resort to theSelOCR
result, which are wrong half of the time. And of course, when forcingCOCR
globally, because the OCR model does not have Greek trained into it, the results are not usable either.