altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Add LANG and ROTATION attributes to Page element #55

Closed urieli closed 1 year ago

urieli commented 5 years ago

In many cases for printed text, all of the TextBlock elements in the page share the same LANG and ROTATION. It would be useful to be able to indicate this once at the Page element level, rather than having to indicate it for each TextBlock.

artunit commented 5 years ago

It might be useful to read through Issue 22 , it would be helpful to better understand the use case. ALTO attempts to capture the rotation and other characteristics of the elements of a page, but the page itself might be better suited for a higher-level encoding like METS or PAGE.

urieli commented 5 years ago

I read through Issue 22, but didn't see how it illuminates the current request.

The use case is: I'm currently writing a browser-based graphical Alto editor for printed text (not manuscript). The plan is to release the editor, as soon as it's stable, into open source. The purpose of the editor is to construct a training corpus for OCR software based on supervised machine learning.

It is far simpler to ask the user to select a default language for the page, than to ask him to select a language for each text box. Similarly, given that in printed text, the vast majority of pages share the same rotation for all elements, it is currently only possible to select a single rotation for the entire page, which rotates the background image.

When loading a page into the editor, it is of course possible to select a candidate default language by a majority vote of elements on the page, but it would be more natural and robust to directly assign the default language to the page, and only store exceptions in text boxes. Similarly, it would be more natural to store and retrieve the rotation at page level, rather than store it in each individual element.

Moreover, the current schema already states, under StringType.LANG: The language should be recorded at the highest level possible.

I'm not sure why page-level characteristics would be better suited for another format such as METS or PAGE, when these characteristics are already stored at the block level. On the other hand, I am sure that trying to manage 2 different formats in the same editor would add considerable complexity, especially when these formats contain redundant information that needs to be reconciled, and when the purpose of the editor is to build training material for an OCR engine.

In brief, the proposal is to add:

<xsd:complexType name="PageType">
  <xsd:attribute name="ROTATION" type="xsd:float" use="optional">
    <xsd:annotation>
      <xsd:documentation>Default rotation for text or illustrations on this page. The value is in degree counterclockwise.</xsd:documentation>
    </xsd:annotation>
  </xsd:attribute>
  <xsd:attribute name="LANG" type="xsd:language" use="optional">
    <xsd:annotation>
      <xsd:documentation>Default language for text on this page.</xsd:documentation>
    </xsd:annotation>
  </xsd:attribute>
  ...
artunit commented 5 years ago

Sorry, I should have been clearer. Some of the examples attached to the issue, such as this, show why ROTATION is so important at the lower levels, but I see you are looking for a higher level default. A browser-based graphical ALTO editor as you have described sounds really cool. I will add a Discussion status to the issue, thanks for providing these details.

urieli commented 5 years ago

Follow up on discussion: it would also be useful to have a default Correction Status CS attribute at the page level, when the entire page has been corrected.

bertsky commented 5 years ago

Again, the same happened with PAGE-XML: on PageType level, we have @primaryLanguage (since 2016) and @orientation (since 2019).

I would like to add a usecase / reasoning for ROTATION on the page level: deskewing is not always easy to estimate accurately, especially for small blocks with only (say) one line. Entropy-based algorithms can quickly overestimate, and produce very inconsistent results over a single page.

As to LANG though, I recommend using xsd:list itemType="xsd:language" instead, because in general (and especially on the page level) multiple languages will mix. For that one usecase is selecting a mix of OCR models. (For example, one could use some external script detection to select models eng+fra with Tesseract.) This is also precisely what PAGE does (allowing a comma-delimited list).

bertsky commented 3 years ago

This is also precisely what PAGE does (allowing a comma-delimited list).

Unfortunately, though, PAGE does not use ISO 639 codes directly, but a custom mapping of ISO 639. Cf. https://github.com/PRImA-Research-Lab/PAGE-XML/issues/27

cipriandinu commented 2 years ago

Would be important to first clarify the meaning of language at page level, since there are different use cases listed here. In general on ALTO a specific attribute is set as high as possible into the hierarchy (f.e. at page level) as a Default one, and then on lower levels this is overwritten if needed. In this case LANG on page level should define Default language for that page, and this should be one language. The other use case is to use LANG as a indicator at top level of possible languages found on that page (an "union" of all languages detected on the page). Both use cases have pro/cons, but I think we can't use both without creating confusion. For ROTATION I think is more meaningful to define the Default at page level, but we have to clarify then what means ROTATION on block text level (relative to the Default, or to 0)

urieli commented 2 years ago

I feel it's much clearer for Page element attributes to give default attribute values for all enclosed elements.

If any enclosed elements have the same attributes, these override the default values (for rotation, the child element has the new absolute rotation, not the rotation relative to the parent rotation).

bertsky commented 2 years ago

You are right: default/inheritance semantics is more useful (and general) than list/alternative semantics, and more consistent with existing spec.

cipriandinu commented 2 years ago

We might use two different attributes: LANG - meaning the default language of that page and OTHERLANG as a list with all languages found on that page (this could be a solution for https://github.com/altoxml/schema/issues/66 too)

cipriandinu commented 2 years ago

Proposed solution as discussed on last meeting (for this issue and for #66) on https://github.com/altoxml/schema/pull/77. I will put both topics on voting as candidates for 4.4 and if solution is approved, will be merged into master version

cipriandinu commented 2 years ago

ACCEPT

cneud commented 2 years ago

ACCEPT

Haighton commented 2 years ago

ACCEPT

callylaw commented 2 years ago

ACCEPT

JLoitzenbauer-CRKN commented 2 years ago

ACCEPT

ntra00 commented 2 years ago

ACCEPT

Ra1phM commented 2 years ago

ACCEPT

hanyelsawy commented 1 year ago

ACCEPT

cowboyMontana commented 1 year ago

ACCEPT

cowboyMontana commented 1 year ago

ACCEPT

c-sebastien commented 1 year ago

ACCEPT

jukervin commented 1 year ago

ACCEPT

cowboyMontana commented 1 year ago

ACCEPT