Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
Apache License 2.0
1.04k stars 209 forks source link

calamari-predict truncates filename #308

Open stefanCCS opened 2 years ago

stefanCCS commented 2 years ago

If the image to be "OCRed" has more than one '.' in filename, some parts of the resulting filename are truncated. E.g.: something.else.png --> something.pred.txt instead of something.else.pred.txt

andbue commented 2 years ago

Right, that's a little bit annoying, I've struggled with that myself before. In ocropus, the image file names contain information on preprocessing (e.g. 001.bin.png) that have to be ignored. If we change the current behaviour, we might brake support for legacy datasets. I don't know if ocr4all needs this - @chreul ? Maybe we could either implement a command line switch to toggle file extension handling or just ignore a specific set of strings (bin, raw, nrm, maybe col?).

maxnth commented 2 years ago

I don't know if ocr4all needs this

OCR4all currently indeed needs this but we could just use a small wrapper / postprocessing script for this (and the newly written back end manages files different anyways) so changing this wouldn't really be a problem for OCR4all.

stefanCCS commented 2 years ago

Well, in my opinion the current behaviour is unexpected for newcomers like myself. I (and I assume any other newcomer) like the idea to change this - any additional command line switch would be ok, of course.