UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Add textract2page #160

Closed bertsky closed 1 year ago

bertsky commented 1 year ago

In contrast to all existing transformations, https://github.com/slub/textract2page MUST know the image file, so I also tried to make it easier for the user to know what script-args are possible/expected:

example calls for `--help-args`

``` > ocr-transform hocr page --help-args Usage: see http://www.saxonica.com/documentation/index.html#!using-xsl/commandline Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -jit -l -lib -license -m -nogo -now -ns -o -opt -or -outval -p -quit -r -relocate -repeat -s -sa -scmin -strip -t -T -target -TB -threads -TJ -Tlevel -Tout -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -y --? Use -XYZ:? for details of option XYZ Params: param=value Set stylesheet string parameter +param=filename Set stylesheet document parameter ?param=expression Set stylesheet parameter using XPath !param=value Set serialization parameter > ocr-transform gcv hocr --help-args Extra arguments: > ocr-transform page alto --help-args page-to-alto options: -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE] Log level --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0] Choose version of ALTO-XML schema to produce (older versions may not preserve all features) --check-words / --no-check-words Check whether PAGE-XML contains any Words and fail if not --check-border / --no-check-border Check whether PAGE-XML contains Border or PrintSpace --skip-empty-lines / --no-skip-empty-lines Whether to omit or keep empty lines in PAGE- XML --trailing-dash-to-hyp / --no-trailing-dash-to-hyp Whether to add a element if the last word in a line ends in "-" --dummy-textline / --no-dummy-textline Whether to create a TextLine for regions that have TextEquiv/Unicode but no TextLine --dummy-word / --no-dummy-word Whether to create a Word for TextLine that have TextEquiv/Unicode but no Word --textequiv-index INTEGER If multiple textequiv, use the n-th TextEquiv by @index --textequiv-fallback-strategy [raise|first|last] What to do if selected TextEquiv @index is not available: 'raise' will lead to a runtime error, 'first' will use the first TextEquiv, 'last' will use the last TextEquiv on the element --region-order [document|reading-order|reading-order-only] Order in which to iterate over the regions --textline-order [document|index|textline-order] Order in which to iterate over the textlines > ocr-transform textract page --help-args textract2page arguments: textract2page options: ```

bertsky commented 1 year ago

You need the image as an argument because the AWS Textract JSON does not contain the image (dimensions)?

Exactly. Textract uses floating point ratios (0..1) for all coordinates. So even if we could live with empty or bogus @imageFilename, we need width and height to calculate the absolute coordinates everywhere.

(BTW, gcv__hocr is another case which needs width and height, but apparently it cannot derive these from the image file, so I just added width and height as script-args there.)

stweil commented 1 year ago

Thank you!

stweil commented 1 year ago

I just noticed that this PR and also a previous commit ff11c354 require a virtual environment because of pip3. That's currently neither documented nor handled automatically in the Makefile.

bertsky commented 1 year ago

I just noticed that this PR and also a previous commit ff11c35 require a virtual environment because of pip3. That's currently neither documented nor handled automatically in the Makefile.

Indeed. I did not notice either. I would leave it to the user to set up a venv or virtualenv or conda environment though. So we would only need a few remarks in the readme IMO.

bertsky commented 1 year ago

On the other hand, we already make users set up a $HOME/.local/bin installation. It would be nice if that would suffice even for Python. For example, we could detect whether VIRTUAL_ENV is already defined, and if not, then create one under the same PREFIX at install-time, and activate it within ocr-transform at run-time.

bertsky commented 1 year ago

https://github.com/UB-Mannheim/ocr-fileformat/pull/162