OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
179 stars 33 forks source link

Rewrite reading/writing PAGE XML files #214

Closed maxnth closed 3 years ago

maxnth commented 4 years ago

Problem

Currently PAGE XML files are read, the parts LAREX needs for operation are extracted and sent as JSON to the editor. After editing is finished and the user saves the results, a new PAGE XML file gets created and the results are written into it.

This poses the problem that PAGE XML "features" which LAREX doesn't support/uses (e.g. comments attributes for certain elements) are getting discarded.

Solution

Reading/Writing PAGE XML should get rewritten in such a way that no existing information gets lost (where possible).

Possible Problems

Cases where existing elements are getting altered significantly (e.g. splitting lines and deciding which of the newly created lines keeps the "old" information or whether both newly created lines inherit the "old" information).

bertsky commented 4 years ago

I fully agree. Practically, if retaining existing annotation is required, one needs to "align/merge" LAREX output and input PAGE in the current state of affairs.

And I surmise that the loss of information affects both elements and attributes ignored by LAREX. The latter can usually be re-integrated trivially. But the former may need some "accounting" of IDrefs as to whether they have been modified or created by the editor.

Splitting segments is a good corner case IMO: Some of the attributes (e.g. TextStyle, @primaryScript/Language) could usually be just copied, whereas others (e.g. TextEquiv, sub-segments like Words in a TextLine or TextLines in a TextRegion, or reading order updates) almost always needs to be (non-trivially) split itself. Some of these decisions could be captured right away in the UI...

maxnth commented 3 years ago

Quoting parts of the description of #264 for an overview of the current state of this issue

What this PR implemented

  • Directly reads from PAGE XML files without skipping validation
  • Fixes a bug which produced invalid PAGE XML (TextEquiv at non-allowed location) and automatically fixes it in invalid PAGE XML on reading
  • Reimplements Polygons, Rectangles, Elements, etc. to actually represent their counterpart (e. g. Polygons now solemnly the actual coordinates in the PAGE XML and therefore don't possess an ID). This eases adding Baselines.
  • Adds certain helper functions / methods to reduce duplicate code fragments (e. g. for custom Polygon creation from PAGE XML)
  • Removes TextLines in the viewer in case a TextRegion (or subtype) is transformed to a region type which isn't allowed to directly have TextLines as children (e. g. paragraph -> ImageRegion)
  • Basic reimplementation of writing from frontend PageAnnotation to existing PAGE XML files. Instead of throwing away the existing PAGE XML and writing a brand new one from the PageAnnotation LAREX now tries to merge changes into the PAGE XML. This keeps elements which aren't currently editable and represented by LAREX (e. g. @comments) instead of discarding them. With this base implementation this doesn't apply to all possible transformations (see below).

What still needs to be implemented

  • Account for splitting / merging segments or textlines and keeping the ID / attributes / (text) content if the users selects this behavior. Switches should be added to the frontend for this. Prototypes for this already exist and will get merged ASAP after some additional local testing.
  • Currently changing the type of a Segment (this does not apply to adding a subtype to a TextRegion or changing from one subtype to another) leads to the destruction of the segment and the creation of a new one with the correct type. In the long run we should allow keeping all existing information which is also allowed in the new type (this has to be verified).