UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
181 stars 22 forks source link

ABBYY2Alto #15

Open zuphilip opened 8 years ago

zuphilip commented 8 years ago

https://github.com/ironymark/AbbyyToAlto, Transformation with php, GPL v3

kba commented 8 years ago

Yes, I've seen it but I very much prefer a declarative transformation in XSLT that has no possible side effects and is easier to test. Maybe we can convert it to XSLT?

zuphilip commented 8 years ago

Yes, it would be preferable to use a XSLT for the transformation.

zuphilip commented 4 years ago

There is also a newer implementation with Java (+Maven): https://github.com/Mewel/abbyy-to-alto

kba commented 4 years ago

How does that compare with https://github.com/PRImA-Research-Lab/prima-page-converter @maxnth

stweil commented 2 years ago

There is also a newer implementation with Java (+Maven): https://github.com/Mewel/abbyy-to-alto

That source code includes at least one copyrighted ~xsl~ file.

kba commented 2 years ago

There is also a newer implementation with Java (+Maven): https://github.com/Mewel/abbyy-to-alto

That source code includes at least one copyrighted xsl file.

It does? I only saw that they include the copyrighted schema for Abbyy 10. We could ask ABBYY for a license to redistribute or omit that file and use the make vendor mechanism.

mikegerber commented 2 years ago

How does that compare with https://github.com/PRImA-Research-Lab/prima-page-converter @maxnth

I had problems with prima-page-converter (going to open a bug report), while Mewel/abbyy-to-alto worked right away.

stweil commented 2 years ago

they include the copyrighted schema for Abbyy 10

Yes, sorry, that was the one which I meant.

mikegerber commented 2 years ago

I had problems with prima-page-converter (going to open a bug report),

https://github.com/PRImA-Research-Lab/prima-page-viewer/issues/24 - I opened the issue against prima-page-viewer as it is affected, too.

mikegerber commented 2 years ago

while Mewel/abbyy-to-alto worked right away.

Sort of - it does not produce Processing tags (or the ALTO v2 equivalent), so it is lacking too.

mikegerber commented 2 years ago

There is also a newer implementation with Java (+Maven): https://github.com/Mewel/abbyy-to-alto That source code includes at least one copyrighted xsl file. It does? I only saw that they include the copyrighted schema for Abbyy 10. We could ask ABBYY for a license to redistribute or omit that file and use the make vendor mechanism.

I'd also like to point out that prima-page-converter has a similiar problem: the PrimaText library is not open source https://github.com/PRImA-Research-Lab/prima-page-converter/issues/17#issuecomment-769817720

stweil commented 2 years ago

Somehow related: I just found a converter from ABBYY to hOCR made by the Internet Archive. No own tests done so far.

mikegerber commented 2 years ago

while Mewel/abbyy-to-alto worked right away. Sort of - it does not produce Processing tags (or the ALTO v2 equivalent), so it is lacking too.

I've added that in https://github.com/Mewel/abbyy-to-alto/pull/16.