ocrd resmgr same comment for 2 tesseract models

OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

MIT License

38 stars 11 forks source link

ocrd resmgr same comment for 2 tesseract models #210

Closed jbarth-ubhd closed 3 months ago

jbarth-ubhd commented 6 months ago

frak2021.traineddata (https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata) Tesseract LSTM model based on Austrian National Library newspaper data
ONB.traineddata (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/ONB_1.195_300718_989100.traineddata) Tesseract LSTM model based on Austrian National Library newspaper data

stweil commented 6 months ago

The 2nd comment is correct, the 1st comment is incomplete: while ONB only used the ground truth from Austrian Newspapers for the training, frak2021 also used additional ground truth (GT4HistOCR and more).

In addition, frak2021 used a newer version of Austrian Newspapers, so the quality of the training data was better. Side note: german_print from 2023/2024 also uses a mix of ground truth data, but even more and newer one than frak2021.

I suggest to update the comment to "Tesseract LSTM model based on a mix of mostly German and Latin ground truth data".

stweil commented 6 months ago

The fix is required for https://github.com/OCR-D/ocrd_tesserocr/blob/master/ocrd_tesserocr/ocrd-tool.json. Therefore the issue should be moved to ocrd_tesserocr. @kba, I don't have the necessary rights to do that.

stweil commented 3 months ago

@bertsky, can you please transfer this issue (or give me the necessary rights for ocrd_tesserocr)?

bertsky commented 3 months ago

Fixed via 75a782d