UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

TEI support? #12

Open zuphilip opened 8 years ago

zuphilip commented 8 years ago

Question from the workshop: Can we also add transformation to/from TEI?

My first impression was that the TEI format is normally used in different applications. But I learned that it is also possible to add x-y-coordinates of boxes in TEI. I haven't look deeper whether this is a suitable feature request...

I found a ALTO2TEI XSLT here: https://github.com/collex/typewright/blob/master/lib/saxon/AltoToTeiA.xsl (some fields are hardcoded for this project and they are writing about some other style sheet where they based theirs on).

zuphilip commented 8 years ago

Also http://able.myspecies.info/abbyy-xml-tei-xml (looks a little special at first glance...)

kba commented 8 years ago

TEI is quite a big standard, lots of different flavors, so there are probably a lot of ways to implement it.

stweil commented 8 years ago

It depends on what you want to achieve. If the primary goal is the transformation from TEI to ALTO for use in the DFG viewer, that reduces the complexity a lot because much data can simply be ignored.

zuphilip commented 8 years ago

We don't have any use case for this at the moment. Maybe, we can just leave the issue open here and collect more information and any possible implementations by reusing some code. BTW I don't think that the technical implementation would be difficult, but reading and understanding format descriptions as well as testing with good examples.

There are a lot of transformation tools for TEI here: https://github.com/TEIC/Stylesheets but ALTO or ABBYY is not among them.

kba commented 8 years ago

Yes, let's keep this open and target the Dfg viewer, that seems feasible.

kba commented 8 years ago

Here's another ALTO to TEI XSL: https://github.com/emory-libraries/readux/blob/master/readux/books/ocr_to_teifacsimile.xsl

cneud commented 8 years ago

See also this service which can convert various formats including ALTO to TEI: https://github.com/INL/OpenConvert

kba commented 8 years ago

Of interest: https://github.com/TEIC/Hackathon/blob/master/DH2015/xsl/hocr2tei.xsl

kba commented 5 years ago

PAGE2TEI https://github.com/dariok/page2tei

zuphilip commented 5 years ago

Thank you @kba, that looks interesting as well! Let me know when anyone wants to work on integrating any of these transformation in ocr-fileformat.

stweil commented 5 years ago

We don't have any use case for this at the moment.

Now we have a use case. We must convert 64833 TEI files (like this one) to ALTO for Kitodo Presentation / DFG Viewer.

jmechnich commented 5 years ago

A first attempt on writing a XSLT can be found here but although it produces valid HOCR, the subsequent transformation to ALTO is not successful (most likely due to the lack of ocr_line in the HOCR file). I guess it would be possible to extract the document's line structure from jumps in the top-left coordinate of the words in a paragraph but I don't see an easy way on how to do this in XSLT. So maybe there will be a python script eventually...

zuphilip commented 5 years ago

Nice! @jmechnich Can you create a PR? Then it is easier to discuss this further. But I am quite happy with such a XSLT transformation, even when there are no ocr_lines (they are AFAIK also missing in your TEI file).

jmechnich commented 4 months ago

Several years later... 😏

Hi all, is this still an open issue as the PR has been merged without further discussion?