OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
179 stars 33 forks source link

move edited/GT TextEquiv to front #282

Open bertsky opened 3 years ago

bertsky commented 3 years ago

Unfortunately, since PAGE-XML completely underspecifies what and how TextEquiv (with or without @index) is used, applications have to define their own convention. IIUC (please correct me if I'm wrong):

LAREX convention

Existing TextEquivs are kept unchanged. Existing @index=0 is treated as GT. Anything else is treated as prediction, and only the highest position/index is shown.

When manual edits are done, GT is updated or created.

When saving, GT (if available) will become @index=0 and prediction (if available) @index=1. These two will be appended to any existing TextEquiv.

PageViewer

PageViewer only shows the first index as tooltip (regardless of @index).

Aletheia

Aletheia only shows the first index as tooltip (regardless of @index) and does allow editing multiple TextEquivs, but does not set (or even show) their @index. (It just calls them Variant1, Variant2 and so on in the GUI.)

@tboenig please correct me if this is not true for the fully licensed version.

OCR-D convention

The current spec says that where multiple TextEquivs are available, @index=1 should be preferred.

However, that's not at all what is currently implemented across OCR-D: processors read the first TextEquiv (regardless of @index) and write starting at @index=0.

(Reason for this behaviour is probably that it's easier to implement and "works" with PageViewer, and the concrete spec language on that matter came too late... So either we change the spec or we fix the implementation now. @kba?)

Solution

To become interoperable with OCR-D, it would currently suffice to just insert the new TextEquiv elements in front of the existing ones (while keeping all the @index rules).

stefanCCS commented 2 years ago

Hi, I just would like to add an addtional topic, which in my opinion should be considered here: Whatever is defined, it should be a clear defined way to generate the according ALTO file from the PAGE file. Best would be also, if this transformation PAGE -> ALTO is defined as one-to-mapping, this means, it is possible to do a ALTO->PAGE transformation. In the ALTO definition you can find on Word level a defintion for "CONTENT" and "ALTERNATIV". As you can see here grafik this is not planned to use for the purpose which is discussed in this issue (at least I have understood that it is about alternative text variants due to different OCR engines including the "special" ORC Engine "GroundTruth"). Instead "VARIANT"s in "GLYPH"s are used to store this information. See grafik

Reference: https://www.loc.gov/standards/alto/v4/alto-4-2.xsd

bertsky commented 2 years ago

@stefanCCS thanks for adding this important aspect!

In the ALTO definition you can find on Word level a defintion for "CONTENT" and "ALTERNATIVE" This is not planned to use for the purpose which is discussed in this issue (at least I have understood that it is about alternative text variants due to different OCR engines including the "special" OCR engine "GroundTruth"). Instead "VARIANT"s in "GLYPH"s are used to store this information

Yes, ALTO only has ALTERNATIVE on word level (StringType), which is indeed meant for other purposes than our concern here, and Variant on glyph level (GlyphType), which does look like our use-case but has no mechanism similar to PAGE's @index.

Since we are mostly concerned with line-level text content here (for which there is no representation at all), not word-level (for which we don't have the proper representation for alternatives), and much less glyph-level (for which we have everything), we must conclude that sadly there simply is no way to get a 1:1 mapping – without severely reinterpreting/abusing the ALTO representation. (For example, one could have one String per TextLine to represent the line level content, and use ALTERNATIVEs for the OCR vs GT. But that convention would always be in danger of being confused with an actual single-word line and actual alternative spellings.)

Our current PAGE→ALTO converter offers an option --dummy-word to preserve line-level text content and options --textequiv-index and --textequiv-fallback-strategy for full control of what TextEquiv to select for @CONTENT in the end.

In the other direction AFAIK we still rely on PRImA's page-converter CLI (which is based on prima-core-libs' parser for ALTO 2.1). That seems to project word-level text to line and region level, so dummy words at least would become TextLine/TextEquiv again. But nothing is done with ALTERNATIVE (or Variant) there.

bertsky commented 2 years ago

… without severely reinterpreting/abusing the ALTO representation. (For example …

(Another possibility, suggested by @stefanCCS privately, would be adding one alto:TextLine/alto:String/alto:Glyph per character of pc:TextLine/pc:TextEquiv/pc:Unicode – with pseudo-coordinates, since we usually do not have word or glyph segmentation available. These Glyphs could then carry Variants naturally. But while Variants of different Glyphs are usually independent of each other, here we would have to give them a special interpretation which prevents mixing/recombining local variants – like first glyph first variant with second glyph second variant. Again, there would always be the danger of being confused with actual Glyphs and actual local variants.)

maxnth commented 2 years ago

To become interoperable with OCR-D, it would currently suffice to just insert the new TextEquiv elements in front of the existing ones (while keeping all the @index rules).

Is the expected behavior that only completely new TextEquiv elements – as in 'no TextEquiv[@index="0"] element existed prior to adding it' – get inserted as first child or should this also apply when users edit the content of already existing TextEquiv[@index="0"]?

bertsky commented 2 years ago

Is the expected behavior that only completely new TextEquiv elements – as in 'no TextEquiv[@index="0"] element existed prior to adding it' – get inserted as first child or should this also apply when users edit the content of already existing TextEquiv[@index="0"]?

The former. (I don't know how LAREX behaves if multiple index0 versions preexist. But whatever index0 it picks up should be the one that OCR-D will see. Therefore, if index0 existed but was not first, LAREX should move it to the fore.)