altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Handwritten documents and ALTO encoding - how to make ALTO more suitable for such documents - ideas #81

Open cipriandinu opened 2 years ago

cipriandinu commented 2 years ago

Handwritten documents are more and more present into current projects and even ALTO can be used today to define a page layout and text information for this type of materials, I think there is still place for improvement. One recent change was related to baseline definition, that was changed from a float value (y coordinate of the line) to PointsType, since for handwritten text the baseline is not a straight line. Probably there are much more issues related to this topic that we can discuss and improve.

This topic is intended to be a place for collecting ideas for further discussions, from here we will collect most important topics and create individual issues

cipriandinu commented 1 year ago

I have asked some people from Transkribus why they choose PAGE instead of ALTO, and what ALTO is missing to be a better format for handwritten comunity, and here is the answer:

"As far as I remember, we chose PAGE as

From here I see one topic we may think on future (since some of the features missing at one point in time are already added, like polyline baseline, polygonal shape on all levels, etc.):

  1. Allow CONTENT on any level, without the need to go deeper into the structure if not needed (f.e. full text line content just below the Textline). Discussion would be if we keep the deeper structure as mandatory for ALTO produces, but make consumer life easier, or we let details as optional on any level (this could lead to a very simple ALTO containing just plain text as part of a single block... ). Might be useful if we look from GT perspective, from presentation systems point of view may not be useful at all.
jukervin commented 1 year ago

Recording different writers can be done with Tags?

M3ssman commented 1 year ago

When working with Transkribus-SWT to generate GT my colleagues and I found ourselves several times running into trouble because we forgot to synchronize text line and word contents. The major advantage (IMHO) for ALTO compared to PAGE is the singular store point for OCR content, especially when one aims to create GT at least on word-level, as we do. Allowing content on text line level might introduce problems with reading order as well when mixing RTL and LTR languages in the same line.