Open zuphilip opened 4 years ago
whether it is afterwards still useful to have the option to specify the exact stylesheet instead of simply any PAGE version
I would leave that option and optionally automate. Note that such automation requires reading and parsing the XML twice, once for the schema detection and once for the actual validation. For bulk processing this should be avoidable.
I am fine with an additional option and leaving the more specific ones as well. :+1:
However, note that we might be able to detect the exact version of a page format easier by e.g. considering the first few lines and looking for some regex match similar to https://github.com/zotero/translators/blob/master/MARCXML.js#L41-L50 .
Usually, I don't remember the exact name of the schema to validate aggainst, e.g.
is hard to remember. However, on the other side, it is usually easy to detect the exact version from inspecting the first lines with the stylesheet definition.
Thus, I suggest to simplify the validation, e.g. such that we can also use
which will then check whether the input file is valid against the stylesheet given at the beginning. Even
could work for XML files and maybe some simply guessing for the others (html -> hocr, JSON -> GCV).
I am not yet sure, whether it is afterwards still useful to have the option to specify the exact stylesheet instead of simply any PAGE version, i.e. to make these simplifications additional rather than replacing the old ones with them.