UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Simplify validations #115

Open zuphilip opened 4 years ago

zuphilip commented 4 years ago

Usually, I don't remember the exact name of the schema to validate aggainst, e.g.

ocr-validate page-2019-07-15 input.xml

is hard to remember. However, on the other side, it is usually easy to detect the exact version from inspecting the first lines with the stylesheet definition.

Thus, I suggest to simplify the validation, e.g. such that we can also use

ocr-validate page input.xml

which will then check whether the input file is valid against the stylesheet given at the beginning. Even

ocr-validate input.xml

could work for XML files and maybe some simply guessing for the others (html -> hocr, JSON -> GCV).

I am not yet sure, whether it is afterwards still useful to have the option to specify the exact stylesheet instead of simply any PAGE version, i.e. to make these simplifications additional rather than replacing the old ones with them.

kba commented 4 years ago

whether it is afterwards still useful to have the option to specify the exact stylesheet instead of simply any PAGE version

I would leave that option and optionally automate. Note that such automation requires reading and parsing the XML twice, once for the schema detection and once for the actual validation. For bulk processing this should be avoidable.

zuphilip commented 4 years ago

I am fine with an additional option and leaving the more specific ones as well. :+1:

However, note that we might be able to detect the exact version of a page format easier by e.g. considering the first few lines and looking for some regex match similar to https://github.com/zotero/translators/blob/master/MARCXML.js#L41-L50 .