altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Schema documentation about use of <SP> needs clarification #54

Open cneud opened 5 years ago

cneud commented 5 years ago

See the discussion at https://github.com/UB-Mannheim/ocr-fileformat/issues/78

artunit commented 5 years ago

This issue generated considerable discussion at the 2018-11-29 Board Meeting. Although it is agreed that ALTO does not strictly require the SP element according to the schema, there is ambiguity about whether it is expected. ABBYY FineReader exports SP by default, and docWorks, which makes use of FineReader, produces SP elements, but there are also many ALTO documents without SP and the XML used for ALTO can balloon when SP is included.

One of the concerns identified in the current implementation of SP is the handling of different unicode sequences for whitespaces, like the Chinese ideographic space character. Another, related concern, is when OCR is used as the basis of transcriptions. If a Fortan listing was OCRed, ALTO could not directly encode the 6 spaces before each statement. The SP element includes a width attribute but a human reader would be expected to infer what it denotes in this case. There are also semantics in some non-latin scripts where a word changes meaning based on spaces, and variants in spaces for some languages that are very unique, for example the use of the Zero-Width Non-Joiner (ZWNJ) in Persian, see this link that outlines the variations in spaces supported by unicode. There was general agreement that adding an optional CONTENT attribute to the SP element would be useful, and by extension, open up the use of the gylph element.