Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as well as ALTO XML, FineReader XML, and HOCR
Apache License 2.0
23
stars
6
forks
source link
does not convert to latest PAGE schema by default #21
When converting some older version, e.g. http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19, I have to explicitly use -convert-to LATEST, or otherwise nothing will be changed.
In trying to understand why, I came up with this hypothesis:
When converting some older version, e.g.
http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19
, I have to explicitly use-convert-to LATEST
, or otherwise nothing will be changed.In trying to understand why, I came up with this hypothesis:
targetFormat
is initialized null-convert-to
is usedLATEST
, then 2019 is selectedreadPage
XmlPageReader
XmlModelAndValidatorProvider
PageXmlModelAndValidatorProvider
getLatestSchemaVersion
defaultSchemas
, i.e. the 2019 versionschemaVersion
is never assigned inXmlPageReader
Page.formatVersion
from the file, not the parser