Closed bertsky closed 1 year ago
You need the image as an argument because the AWS Textract JSON does not contain the image (dimensions)?
Exactly. Textract uses floating point ratios (0..1) for all coordinates. So even if we could live with empty or bogus @imageFilename
, we need width and height to calculate the absolute coordinates everywhere.
(BTW, gcv__hocr is another case which needs width and height, but apparently it cannot derive these from the image file, so I just added width and height as script-args there.)
Thank you!
I just noticed that this PR and also a previous commit ff11c354 require a virtual environment because of pip3
.
That's currently neither documented nor handled automatically in the Makefile.
I just noticed that this PR and also a previous commit ff11c35 require a virtual environment because of
pip3
. That's currently neither documented nor handled automatically in the Makefile.
Indeed. I did not notice either. I would leave it to the user to set up a venv or virtualenv or conda environment though. So we would only need a few remarks in the readme IMO.
On the other hand, we already make users set up a $HOME/.local/bin
installation. It would be nice if that would suffice even for Python. For example, we could detect whether VIRTUAL_ENV
is already defined, and if not, then create one under the same PREFIX
at install-time, and activate it within ocr-transform
at run-time.
In contrast to all existing transformations, https://github.com/slub/textract2page MUST know the image file, so I also tried to make it easier for the user to know what
script-args
are possible/expected:example calls for `--help-args`
``` > ocr-transform hocr page --help-args Usage: see http://www.saxonica.com/documentation/index.html#!using-xsl/commandline Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -jit -l -lib -license -m -nogo -now -ns -o -opt -or -outval -p -quit -r -relocate -repeat -s -sa -scmin -strip -t -T -target -TB -threads -TJ -Tlevel -Tout -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -y --? Use -XYZ:? for details of option XYZ Params: param=value Set stylesheet string parameter +param=filename Set stylesheet document parameter ?param=expression Set stylesheet parameter using XPath !param=value Set serialization parameter > ocr-transform gcv hocr --help-args Extra arguments:
> ocr-transform page alto --help-args
page-to-alto options:
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
--alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
Choose version of ALTO-XML schema to produce
(older versions may not preserve all
features)
--check-words / --no-check-words
Check whether PAGE-XML contains any Words
and fail if not
--check-border / --no-check-border
Check whether PAGE-XML contains Border or
PrintSpace
--skip-empty-lines / --no-skip-empty-lines
Whether to omit or keep empty lines in PAGE-
XML
--trailing-dash-to-hyp / --no-trailing-dash-to-hyp
Whether to add a element if the last
word in a line ends in "-"
--dummy-textline / --no-dummy-textline
Whether to create a TextLine for regions
that have TextEquiv/Unicode but no TextLine
--dummy-word / --no-dummy-word Whether to create a Word for TextLine that
have TextEquiv/Unicode but no Word
--textequiv-index INTEGER If multiple textequiv, use the n-th
TextEquiv by @index
--textequiv-fallback-strategy [raise|first|last]
What to do if selected TextEquiv @index is
not available: 'raise' will lead to a
runtime error, 'first' will use the first
TextEquiv, 'last' will use the last
TextEquiv on the element
--region-order [document|reading-order|reading-order-only]
Order in which to iterate over the regions
--textline-order [document|index|textline-order]
Order in which to iterate over the textlines
> ocr-transform textract page --help-args
textract2page arguments:
textract2page options:
```