OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
177 stars 33 forks source link

content of PageXML overwritten #302

Closed alexander-winkler closed 2 years ago

alexander-winkler commented 2 years ago

I have a problem that has repeatedly led to considerable data loss (and I have the impression that I have seen some Issue/discussion on that, so I apologize in advance):

In a set of legacy PageXMLs some run through smoothly in v.0.6, other however get overwritten. Here is a diff of the first 7 resp. 8 lines. The file on the left is processed without problem, the file on the left is overwritten as soon as saved.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>        | <?xml version="1.0"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pageco   <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pageco
  <Metadata>                              <Metadata>
    <Creator/>                            |     <Creator>User123</Creator>
    <Created>2022-01-19T20:02:07</Created>            |     <Created>2021-06-16T20:13:22</Created>
    <LastChange>1970-01-01T00:00:00</LastChange>          |     <LastChange>2021-06-16T20:13:22</LastChange>
    <Comments/>                           <
  </Metadata>                             </Metadata>

Here is a file that gets overwritten when saved with the current OCR4all docker version: https://cloud.uni-halle.de/s/cgzRExPB0xPRP3I

After saving the right file is reduced to the following:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

Do you happen to know how to avoid this behaviour?

bertsky commented 2 years ago

The file on the left is processed without problem, the file on the left is overwritten as soon as saved.

That's two times the left side.

The relevant part has been cut off. Please use unified mode for comparisons (diff -u).

I have seen this happen in older versions of LAREX (before it became a true PAGE-XML editor). Which version are you using?

alexander-winkler commented 2 years ago

I'm using the current docker version (0.5.0).

image

Here is the `diff -u` output of a file that get overwritten (first argument) and a file that's ok (second argument) ``` --- /dev/fd/63 2022-01-20 13:04:37.571947384 +0100 +++ /dev/fd/62 2022-01-20 13:04:37.571947384 +0100 @@ -1,277 +1,278 @@ - + - User123 - 2021-06-16T20:13:22 - 2021-06-16T20:13:22 + + 2022-01-19T20:02:07 + 1970-01-01T00:00:00 + - - - - - + + + + + + + de i Regni. Le Donne fanno ſoſpirare e pian- + - le prattica. Sono donna, e ſe non la foſſi, + de i Regni. Le Donne fanno ſoſpirare e pian- - - + + + + gere: ma per che? per amore; e gl' vomini? + - vorrei farmi ſtampare di nuovo, per prendere + gere: ma per che? per amore; e gl' vomini? - - + + + + fanno piangere diſperatamente, con le ingius- + - + fanno piangere diſperatamente, con le ingius- - - + + + + titie, con gl' aſſaſſinj, e con gl' odi perpetuj; + - una forma, che (come queſta) dà forma a + titie, con gl' aſſaſſinj, e con gl' odi perpetuj; - - + + + + mi ſtupiſco che gl' vomini non arronſischino + - tutte le umane contenteʒʒe. + mi ſtupiſco che gl' vomini non arronſischino - - - + + + - - - - + come il Roſto sul fuoco, quando intraprendo- - - + + + + + - perbette le Donne, eſſendo tanto ſuperiori di + no di ſollevarſi, opprimendo le pouere Don- - - - + + + - - - - merito a gl' vomini ſuperbiſſimi. + ne! Si hanno aſſunto (con l' autorità ſpauen- - - - + + + - - - - + tosa d' vomo,) l' impiego de Magiſtrati; do- - - + + + + + - abſenʒa, ſono quegl' iſteſſi che ci ſupplicano di + ue non ſi ſentono, che ſentenʒe appaſſionate, - - + + + + + - pietà, ci adorano come Dee, e per obligarci, + Liti eternate, Proceſſi male eſaminati; e non - - + + + + + - non penſano a ſpendere in un hora quanto i + ſi ſtimano che quegl' Avocati, che ſono abili - - - + + + - - - - loro antenati accumulorno in cent' anni. + a vincere quello, che di giuſtitia farebbe per- - - - + + + - - - - ErT + so. (intendiamoſi che parlo in generale, nè - - + + + + + - Scena VJJ. + pretendo di offendere in particolare.) Ma per - - - + + + - - - - fefaf + che ſono vomini, ſi tace, ſe foſſe una Donna - - - + + + - - - - Ae + gridarebbero più de i Galli quando preſagiſco- - - - + + + - - - - + no la pioggia. Gl' aſſaſſini da ſtrada, ſono - - - + + + - - - - 6 + vomini; le gouernatrici della Casa ſono Don- - - + + + + + - or + ne; Quelli che fanno le guerre ſono vomini, - - - + + + - - - - Lto + quelle che deſiderano perpetua pace fono Don- - - - + + + - - - - grandi da dirti, & altre maggiori da inſegnarti. + ne; J ridotti doue ſi machinano tradimenti, o s' - - - + + + - - - - ſentire ancora una Donna di ſpirito. + interpretano ſiniſtramente le attioni de Pren- - - - + + + - - - - maeſtramenti, come quelli che vengo di rice- + cipi, ſono formati d' vomini; le converſationi - - - + + + - - - - uere. Sentiamoci per che ſono ſtanca. + doue ſi raccontano favole, e fra giochi inno- - - - + + + - - - - Si ſentano. + centi ſi ride, ſono formati di Donne eh! che - - - + + + + + le Donne ſono i condimenti della ſocietà, le + - - + + + + + - Scena + riſtoratrici della natura, le vere felicità di chi - - + + + + + - M 2 + le prat- + + + ```
maxnth commented 2 years ago

The current latest release of OCR4all sadly features a LAREX version <0.6 which is still susceptible to this bug. You could try upgrading to uniwuezpd/ocr4all:staging which includes the current staging version of LAREX. We'll also release a new ocr4all:latest in the next few days after ironing out some of the remaining bugs in LAREX.

alexander-winkler commented 2 years ago

Thanks! The staging version doesn't read the xml in question (you'd anticipated this in #301). I might have to convert the 'old' PageXML into the correct version understood and read by the current LAREX versions to get some backwards compatibility. Is it just the negative coordinates that cause the problem? I can't spot any other major difference between a legible xml and a one that isn't. Sorry for all these questions.

bertsky commented 2 years ago

Use xmllint --schema path/to/pagecontent.xsd --noout page.xml to get some diagnostics.

For repairs, have a look at the page XSLT in https://github.com/bertsky/workflow-configuration/

maxnth commented 2 years ago

Is it just the negative coordinates that cause the problem?

At least for the file provided in #301 the negative coordinates are indeed the only thing which make the PAGE XML invalid. I just set all negative coordinate points to 0 and it loaded just fine afterwards.

alexander-winkler commented 2 years ago

Fantastic help, thank you very much!

Just for the record (and an aide-memoire for myself), I followed @bertsky 's advice:

wget "https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd"
xmllint --schema pagecontent.xsd --noout BOOK/processing/*.xml

Only pages with negative coordinates didn't validate.

Then I used the xslt mentioned above to transform my pagexmls.

wget "https://raw.githubusercontent.com/bertsky/workflow-configuration/master/page-fix-coords.xsl"
mkdir tmp_output
for i in BOOK/processing/*.xml; do xsltproc -o tmp_output/$(basename $i) page-fix-coords.xsl $i; done

The PageXMLs in tmp_output work perfectly fine.

Many thanks to both of you!