UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Support DjVu format? #37

Open zuphilip opened 7 years ago

zuphilip commented 7 years ago

I don't know if this makes sense, but I heard that DjVu can also save hidden text and there exists https://github.com/jwilk/ocrodjvu as with some files like djvu2hocr and hocr2djvused and also transfromation from and to pdf: https://wiki.ubuntuusers.de/pdf2djvu/ , https://wiki.ubuntuusers.de/djvu2pdf/ . Is DjVu an ocr-fileformat which is interesting in the context here?

kba commented 7 years ago

@jwilk has done a lot of work on OCR with DjVu in digitization in Poland IIUC. Not sure how widely DjVu is used, it's certainly an interesting and fitting format for OCR. Integration would not be a huge effort but I hesitate to do it without a use case and test data, since creating a presentation-level container format is more involved and subjective than converting between XML representations.

If we were to develop conversions like this, I would prioritize PDF, not for being the better but the more widely used format.