Closed alexander-winkler closed 2 years ago
The file on the left is processed without problem, the file on the left is overwritten as soon as saved.
That's two times the left side.
The relevant part has been cut off. Please use unified mode for comparisons (diff -u
).
I have seen this happen in older versions of LAREX (before it became a true PAGE-XML editor). Which version are you using?
I'm using the current docker version (0.5.0).
The current latest release of OCR4all sadly features a LAREX version <0.6 which is still susceptible to this bug.
You could try upgrading to uniwuezpd/ocr4all:staging
which includes the current staging version of LAREX. We'll also release a new ocr4all:latest
in the next few days after ironing out some of the remaining bugs in LAREX.
Thanks! The staging version doesn't read the xml in question (you'd anticipated this in #301). I might have to convert the 'old' PageXML into the correct version understood and read by the current LAREX versions to get some backwards compatibility. Is it just the negative coordinates that cause the problem? I can't spot any other major difference between a legible xml and a one that isn't. Sorry for all these questions.
Use xmllint --schema path/to/pagecontent.xsd --noout page.xml
to get some diagnostics.
For repairs, have a look at the page XSLT in https://github.com/bertsky/workflow-configuration/
Is it just the negative coordinates that cause the problem?
At least for the file provided in #301 the negative coordinates are indeed the only thing which make the PAGE XML invalid. I just set all negative coordinate points to 0 and it loaded just fine afterwards.
Fantastic help, thank you very much!
Just for the record (and an aide-memoire for myself), I followed @bertsky 's advice:
wget "https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd"
xmllint --schema pagecontent.xsd --noout BOOK/processing/*.xml
Only pages with negative coordinates didn't validate.
Then I used the xslt mentioned above to transform my pagexmls.
wget "https://raw.githubusercontent.com/bertsky/workflow-configuration/master/page-fix-coords.xsl"
mkdir tmp_output
for i in BOOK/processing/*.xml; do xsltproc -o tmp_output/$(basename $i) page-fix-coords.xsl $i; done
The PageXMLs in tmp_output work perfectly fine.
Many thanks to both of you!
I have a problem that has repeatedly led to considerable data loss (and I have the impression that I have seen some Issue/discussion on that, so I apologize in advance):
In a set of legacy PageXMLs some run through smoothly in v.0.6, other however get overwritten. Here is a diff of the first 7 resp. 8 lines. The file on the left is processed without problem, the file on the left is overwritten as soon as saved.
Here is a file that gets overwritten when saved with the current OCR4all docker version: https://cloud.uni-halle.de/s/cgzRExPB0xPRP3I
After saving the right file is reduced to the following:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Do you happen to know how to avoid this behaviour?