UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Add XSLT for transformation from PAGEXML to text #91

Closed zuphilip closed 4 years ago

zuphilip commented 5 years ago

The XSLT file is tested with https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/documentation/example/SimplePage.xml and https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/pagecontent/examples/aletheiaexamplepage.xml but outside this repo. Is it enough to copy the file in this directory, or has the Makefile etc. to be adjusted as well? CC @kba

BTW I did use TextRegion and TextLine instead of TextEquiv, because this looked better for me (different from the approach in https://github.com/cneud/page-to-text/blob/master/page_to_text.py by @cneud ).