Open cipriandinu opened 2 years ago
I have asked some people from Transkribus why they choose PAGE instead of ALTO, and what ALTO is missing to be a better format for handwritten comunity, and here is the answer:
"As far as I remember, we chose PAGE as
From here I see one topic we may think on future (since some of the features missing at one point in time are already added, like polyline baseline, polygonal shape on all levels, etc.):
Recording different writers can be done with Tags?
When working with Transkribus-SWT to generate GT my colleagues and I found ourselves several times running into trouble because we forgot to synchronize text line and word contents. The major advantage (IMHO) for ALTO compared to PAGE is the singular store point for OCR content, especially when one aims to create GT at least on word-level, as we do. Allowing content on text line level might introduce problems with reading order as well when mixing RTL and LTR languages in the same line.
Handwritten documents are more and more present into current projects and even ALTO can be used today to define a page layout and text information for this type of materials, I think there is still place for improvement. One recent change was related to baseline definition, that was changed from a float value (y coordinate of the line) to PointsType, since for handwritten text the baseline is not a straight line. Probably there are much more issues related to this topic that we can discuss and improve.
This topic is intended to be a place for collecting ideas for further discussions, from here we will collect most important topics and create individual issues