Open jbaiter opened 4 years ago
Here is a 1p example for a BSB 1p_export_m117_mekhilta_new_alto_202102111507.zip Hebrew manuscript in Alto 4.1 with Polylines und Polygon-Regions IIIF: https://api.digitale-sammlungen.de/iiif/presentation/v2/bsb00084914/manifest
I created a bare-bones manifest with the ALTO here (CORS-enabled): https://rawgit.com/jbaiter/d5fcd5f72349a6ad19ccb8f6e9e7d9db/raw/9ec3c5655d3749a4461103b49f06b674f6a7c440/bsb00084914.json
Looks like right-to-left reading order works out of the box, so one thing less to worry about!
One problem is that there's an issue with the rendering of ambivalent unicode bidi codepoints (the glyphs in brackets).
fantastic!
ALTO supports a
@BASELINE
attribute that can define a polyline on which the text rests. hOCR also includes support for this information. These values could be used for a more accurate estimation of the font size and position used for rendering the SVG.Unfortunately I don't have access to any samples of OCR data with this information at the moment.The test fixtures now include an hOCR file (generated by Tesseract 4) that has baseline information. Since both hOCR and ALTO define baselines as polynomials, an hOCR-based implementation should work with ALTO with (hopefully) minimal modifications.