UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Update Saxon-HE #144

Closed stweil closed 2 years ago

stweil commented 2 years ago

The new stable version 10 no longer has a patch version and uses a different naming convention for the .jar file. Therefore ocr-fileformat now always uses a symbolic link saxon.jar which links to the installed .jar file.

Make also the round trip test a little bit more stable by running tesseract with an explicit path. The round trip test shows some differences between initial and final ALTO file, but those differences look acceptable.

stweil commented 2 years ago

See related issue #124.

stweil commented 2 years ago

Meanwhile there is a newer 10.8 (released 2022-03-15), so I updated the pull request. Can we merge?

There is also a new major release 11:

"Saxon 11.3 is the latest release for production use; however for critical applications using SaxonJ we advise that the Saxon 10 branch remains the most stable release for the time being."

Should we skip release 10 and go directly from 9 to 11?