Open zuphilip opened 7 years ago
@jwilk has done a lot of work on OCR with DjVu in digitization in Poland IIUC. Not sure how widely DjVu is used, it's certainly an interesting and fitting format for OCR. Integration would not be a huge effort but I hesitate to do it without a use case and test data, since creating a presentation-level container format is more involved and subjective than converting between XML representations.
If we were to develop conversions like this, I would prioritize PDF, not for being the better but the more widely used format.
I don't know if this makes sense, but I heard that DjVu can also save hidden text and there exists https://github.com/jwilk/ocrodjvu as with some files like
djvu2hocr
andhocr2djvused
and also transfromation from and to pdf: https://wiki.ubuntuusers.de/pdf2djvu/ , https://wiki.ubuntuusers.de/djvu2pdf/ . Is DjVu an ocr-fileformat which is interesting in the context here?