UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Update SaxonHE to version 11.2 #149

Closed stweil closed 1 year ago

stweil commented 2 years ago

The new version needs more jar files from the zip archive, so simply extract all files included there.

Signed-off-by: Stefan Weil sw@weilnetz.de

stweil commented 2 years ago

SaxonHE 11 also supports Python code like in this example:

#!/usr/bin/python3

stylesheet_file = 'alto__hocr.xsl'
source_file = 'wetzel_reisebegleiter_1901_0021.alto'

from saxonc import *

with PySaxonProcessor(license=False) as proc:
    xdmAtomicval = proc.make_boolean_value(False)
    xsltproc = proc.new_xslt30_processor()
    outputi = xsltproc.transform_to_string(source_file=source_file, stylesheet_file=stylesheet_file)
    print(outputi)

saxonc was installed using instructions from https://saxonica.com/html/download/c.html and from https://www.saxonica.com/saxon-c/documentation11/index.html#!starting/installingpython like this:

wget https://saxonica.com/download/libsaxon-HEC-setup64-v11.3.zip
unzip libsaxon-HEC-setup64-v11.3.zip
cd libsaxon-HEC-11.3/
mkdir lib
cp -al libsaxonhec.so rt saxon-data lib
export SAXONC_HOME=$PWD/lib
python3.9 -m venv ../venv3.9
source ../venv3.9/bin/activate
pip install -U pip setuptools wheel
pip install cython
cd Saxon.C.API/python-saxon/
python3 saxon-setup.py build_ext -if
export PYTHONPATH=$PWD