OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
177 stars 33 forks source link

"Load Result" with parsererror #326

Closed l0rn0r closed 1 year ago

l0rn0r commented 1 year ago

Hello I'm running the OCR4all Docker container on my Ubuntu 20.04. It works quite well but there is an error, when I tried to load a PageXML in LAREX.

When I have a page in the LAREX editor, which went through every OCR4all steps till recognition, I wanted to load an already existing PageXML of this page - to check if I could load a ground truth text for training - I get the error message: "Couldn't retrieve annotations from file."

And in the console it says "request:/file/upload/annotations - fail 'parsererror'" which is indicated by Larex/resources/js/viewer/communicator.js, Line 17 - failed Post-request. The writing permissions of the data-folder on the server should be good (777). The PageXML file is v2013-07-15.

Any hint for this problem? Or any hint how to load ground truth from existing PageXMLs to train a new model?

bertsky commented 1 year ago

Don't remember anything about OCR4all integration (request API), but I often see this error with valid PAGE-XML files when

(This is due to the parser from PRImA being not very robust, and not exposing the internal cause of error correctly.)

maxnth commented 1 year ago

Excuse the late reply, I somehow totally overlooked this issue. As already mentioned above, this is most likely caused by an PAGE XML file which isn't valid according to the schema. If you could upload the XML file which causes the error, I'll have a look at it.

bertsky commented 1 year ago

Except for the last point (@points format), these are all cases which do not violate the schema. It's only the PRImA parser that fails. This is reproducible with all PRImA tools (editor, converter, viewer, layout evaluation), too.

I don't have examples readily available, but it should be straightforward to construct some from your existing test cases.

maxnth commented 1 year ago

Except for the last point (@points format), these are all cases which do not violate the schema.

I'm not an XML schema expert so the following train of thought might be flawed but I'd be interested to know why the above mentioned cases wouldn't make the XML invalid?

bertsky commented 1 year ago
  • @regionRef has IDREF as type and AFAIK this should always require the referenced ID to be present in the document according to the XML Schema Definition to make the document valid, doesn't it?

You're right. Dangling IDREF should make the document invalid as of XML specification. I had based my judgement on the behaviour of the libxml2 implementation, which does not check IDREF.

  • e.g. OrderedGroup requires minOccurs="1" for either RegionRefIndexed / OrderedGroupIndexed / UnorderedGroupIndexed so it being completely empty shouldn't be valid according to the schema

Right again, my bad.

  • As there isn't any minOccurs value explicitly set for Unicode elements in a TextEquiv it defaults to minOccurs="1" and therefore should be mandatory to make the document valid

Again, you're spot on. Sorry for my sloppy nonsense! (I carried this misconception with me for quite some time...)

maxnth commented 1 year ago

I'll close this for now, feel free to reopen this @l0rn0r if the issue still persists and isn't caused by invalid PAGE XML (or if the invalid PAGE XML is produced by OCR4all).