Closed l0rn0r closed 1 year ago
Don't remember anything about OCR4all integration (request API), but I often see this error with valid PAGE-XML files when
@regionRef
does not exist@points
are negative or float (which is also invalid by schema)(This is due to the parser from PRImA being not very robust, and not exposing the internal cause of error correctly.)
Excuse the late reply, I somehow totally overlooked this issue. As already mentioned above, this is most likely caused by an PAGE XML file which isn't valid according to the schema. If you could upload the XML file which causes the error, I'll have a look at it.
Except for the last point (@points
format), these are all cases which do not violate the schema. It's only the PRImA parser that fails. This is reproducible with all PRImA tools (editor, converter, viewer, layout evaluation), too.
I don't have examples readily available, but it should be straightforward to construct some from your existing test cases.
Except for the last point (@points format), these are all cases which do not violate the schema.
I'm not an XML schema expert so the following train of thought might be flawed but I'd be interested to know why the above mentioned cases wouldn't make the XML invalid?
@regionRef
has IDREF
as type and AFAIK this should always require the referenced ID to be present in the document according to the XML Schema Definition to make the document valid, doesn't it? OrderedGroup
requires minOccurs="1"
for either RegionRefIndexed
/ OrderedGroupIndexed
/ UnorderedGroupIndexed
so it being completely empty shouldn't be valid according to the schemaminOccurs
value explicitly set for Unicode
elements in a TextEquiv
it defaults to minOccurs="1"
and therefore should be mandatory to make the document valid
@regionRef
hasIDREF
as type and AFAIK this should always require the referenced ID to be present in the document according to the XML Schema Definition to make the document valid, doesn't it?
You're right. Dangling IDREF should make the document invalid as of XML specification. I had based my judgement on the behaviour of the libxml2 implementation, which does not check IDREF.
- e.g.
OrderedGroup
requiresminOccurs="1"
for eitherRegionRefIndexed
/OrderedGroupIndexed
/UnorderedGroupIndexed
so it being completely empty shouldn't be valid according to the schema
Right again, my bad.
- As there isn't any
minOccurs
value explicitly set forUnicode
elements in aTextEquiv
it defaults tominOccurs="1"
and therefore should be mandatory to make the document valid
Again, you're spot on. Sorry for my sloppy nonsense! (I carried this misconception with me for quite some time...)
I'll close this for now, feel free to reopen this @l0rn0r if the issue still persists and isn't caused by invalid PAGE XML (or if the invalid PAGE XML is produced by OCR4all).
Hello I'm running the OCR4all Docker container on my Ubuntu 20.04. It works quite well but there is an error, when I tried to load a PageXML in LAREX.
When I have a page in the LAREX editor, which went through every OCR4all steps till recognition, I wanted to load an already existing PageXML of this page - to check if I could load a ground truth text for training - I get the error message: "Couldn't retrieve annotations from file."
And in the console it says
"request:/file/upload/annotations - fail 'parsererror'"
which is indicated byLarex/resources/js/viewer/communicator.js
, Line 17 - failed Post-request. The writing permissions of the data-folder on the server should be good (777). The PageXML file is v2013-07-15.Any hint for this problem? Or any hint how to load ground truth from existing PageXMLs to train a new model?