dbmdz / mirador-textoverlay

Text Overlay plugin for Mirador 3
https://mirador-textoverlay.netlify.com/
MIT License
50 stars 13 forks source link

Use baseline information for improved text rendering #4

Open jbaiter opened 4 years ago

jbaiter commented 4 years ago

ALTO supports a @BASELINE attribute that can define a polyline on which the text rests. hOCR also includes support for this information. These values could be used for a more accurate estimation of the font size and position used for rendering the SVG.

Unfortunately I don't have access to any samples of OCR data with this information at the moment.

The test fixtures now include an hOCR file (generated by Tesseract 4) that has baseline information. Since both hOCR and ALTO define baselines as polynomials, an hOCR-based implementation should work with ALTO with (hopefully) minimal modifications.

dstoekl commented 3 years ago

Here is a 1p example for a BSB 1p_export_m117_mekhilta_new_alto_202102111507.zip Hebrew manuscript in Alto 4.1 with Polylines und Polygon-Regions IIIF: https://api.digitale-sammlungen.de/iiif/presentation/v2/bsb00084914/manifest

jbaiter commented 3 years ago

I created a bare-bones manifest with the ALTO here (CORS-enabled): https://rawgit.com/jbaiter/d5fcd5f72349a6ad19ccb8f6e9e7d9db/raw/9ec3c5655d3749a4461103b49f06b674f6a7c440/bsb00084914.json

Looks like right-to-left reading order works out of the box, so one thing less to worry about! image

One problem is that there's an issue with the rendering of ambivalent unicode bidi codepoints (the glyphs in brackets).

dstoekl commented 3 years ago

fantastic!