UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Challenges processing textract #187

Open joewiz opened 2 weeks ago

joewiz commented 2 weeks ago

Thank you for providing this great tool! I'm writing with two questions about transforming textract output into page xml.

First, when I use the digi.bib.uni-mannheim.de hosted version of the ocr-fileformat application and try to transform textract JSON to page, the resulting XML file is empty.

Here's a sample textract input and resulting page output.

No errors - just an empty file.

For context, the AWS CLI command I used to produce this input was (with Bucket, Name, and Region obfuscated here):

$ aws textract start-document-analysis --document '{"S3Object":{"Bucket":"my-bucket","Name":"my-path/T172-09-0003.tif"}}' --feature-types '["LAYOUT"]' --region my-region

Does anyone have tips for ways to ensure ocr-fileformat can process textract output? (If not, I could ask at the upstream project https://github.com/slub/textract2page/ - but I figured I'd start here.)

Second, when I try to process the output locally via Docker, I don't see the textract option. From https://digi.bib.uni-mannheim.de/ocr-fileformat/:

Screenshot 2024-08-25 at 14 52 05

From version running locally in Docker:

Screenshot 2024-08-25 at 14 52 31

Does anyone have suggestions for getting the textract input option to appear when running locally via Docker?

Thank you!

bertsky commented 2 weeks ago

I don't know why the local docker installation does not pick up textract2page (assuming you ran make docker – the version on Dockerhub is hopelessly outdated), but have you tried doing a local native installation? (You need Python and a venv for that...) Also, textract2page can be installed directly – if that gives you issues, please report.

If you do install textract2page, consider updating to https://github.com/slub/textract2page/pull/23 – it should give best results, but is still not 100% tested.

joewiz commented 2 weeks ago

@bertsky Thanks for your reply! Indeed, I had been using the version on Dockerhub. Since I'm currently most interested in the textract2page conversion, I think I'll focus first on trying to get output from textract2page utility, and once that works, I'll work on ocr-fileformat.

For both the master branch and your https://github.com/slub/textract2page/pull/23 PR, I got the same results as I did above - a 0-byte XML file. To troubleshoot, I tried running textract2page on one of the included test images and got the same result:

textract2page on  toplevel-reading-order [?] via 🐍 v3.11.7 
❯ textract2page tests/workspace/textract_responses/18xx-Missio-EMU-0042.json tests/workspace/images/18xx-Missio-EMU-0042.jpg > test.xml

textract2page on  toplevel-reading-order [?] via 🐍 v3.11.7 
❯ ls -l test.xml               
-rw-r--r--@ 1 joe  staff  0 Aug 26 08:55 test.xml

Do you have any suggestions for coaxing textract2page to return results? I'd be happy to provide any info about my environment that would help. For starters, I'm on an Intel Mac running macOS 14.6.1, with python 3.11 (miniconda3-3.11-24.1.2-0, installed via asdf).

joewiz commented 2 weeks ago

With an eye toward getting the current version of ocr-fileformat working locally, I went ahead and installed it using the instructions in the README. While the build steps appeared to work perfectly, I am getting an error when calling the utility.

❯ ~/.local/bin/ocr-transform -h    
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 39: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 41: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 43: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 57: alto4.2__alto2.1: syntax error: invalid arithmetic operator (error token is ".2__alto2.1")
Usage:
ocr-transform [OPTIONS] <from> <to> [<infile> [<outfile>]] [-- <script-args>]
ocr-transform [OPTIONS] <from> <to> --help-args Show script-args, and exit
ocr-transform [OPTIONS] -h|--help               Show this help, and exit
ocr-transform [OPTIONS] -v|--version            Show version, and exit
ocr-transform [OPTIONS] -L|--list               List available from/to, and exit

    Options:
        --debug   -d     Increase debug level by 1, can be repeated

    Transformations:
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 23: show_transformations: command not found

With the debug flag set:

❯ ~/.local/bin/ocr-transform -d -d   
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 39: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 41: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 43: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 57: alto4.2__alto2.1: syntax error: invalid arithmetic operator (error token is ".2__alto2.1")
+ local from= to= infile=- outfile=- transformer
+ shift 2
+ [[ -z '' ]]
+ show_usage 'Must set '\''from'\'' parameter'
+ [[ 1 -gt 0 ]]
+ logerr 'Must set '\''from'\'' parameter'
+ local 'IFS=
'
+ for line in '$*'
+ echo -e '\033[0m[\033[1;31mERROR\033[0m] Must set '\''from'\'' parameter'
[ERROR] Must set 'from' parameter
+ echo 'Usage:
ocr-transform [OPTIONS] <from> <to> [<infile> [<outfile>]] [-- <script-args>]
ocr-transform [OPTIONS] <from> <to> --help-args Show script-args, and exit
ocr-transform [OPTIONS] -h|--help               Show this help, and exit
ocr-transform [OPTIONS] -v|--version            Show version, and exit
ocr-transform [OPTIONS] -L|--list               List available from/to, and exit

    Options:
        --debug   -d     Increase debug level by 1, can be repeated

'
Usage:
ocr-transform [OPTIONS] <from> <to> [<infile> [<outfile>]] [-- <script-args>]
ocr-transform [OPTIONS] <from> <to> --help-args Show script-args, and exit
ocr-transform [OPTIONS] -h|--help               Show this help, and exit
ocr-transform [OPTIONS] -v|--version            Show version, and exit
ocr-transform [OPTIONS] -L|--list               List available from/to, and exit

    Options:
        --debug   -d     Increase debug level by 1, can be repeated

+ echo -e '\n    Transformations:'

    Transformations:
+ show_transformations
/Users/joe/.local/share/ocr-fileformat/lib.sh: line 23: show_transformations: command not found
+ sed 's/^/        /'
+ [[ 1 -gt 0 ]]
+ exit 1

I'd be happy to provide any info or perform any steps that could help shed light.

bertsky commented 2 weeks ago

How did you install textract2page – in a local venv (as advised by the readme)?

Also, if you did pull the PR: you probably have to specify pip install ./textract2page instead of just pip install textract2page (which might pull from PyPI). Still, zero-byte (esp. without warnings) should not happen.

What does pip show textract2page and python -c "import textract2page; print(textract2page.__version__)" say?

Regarding the installation of ocr-fileformat on MacOS – I believe you first need to install bash (and probably other packages) via brew.

bertsky commented 2 weeks ago

Oops, sorry – turns out textract2page recently introduced a packaging bug, which affected editable installs. Please pull from the PR again (tip should be at bf89b08)!

joewiz commented 2 weeks ago

@bertsky Thank you! And to make a long story short, your suggestion for building from your branch worked, and I was able to successfully generate Page XML for both the included sample files and my own! Here's the final output, using the input file I mentioned in the original posting above. The only possible issue is that the HEAD of your branch is 4eb96ab, not bf89b08 as you just mentioned.

But I've left my full answers to your questions, in case there is any useful info there.

First, as to how I first installed textract2page, I used python 3.11 (miniconda3-3.11-24.1.2-0, which I'd installed via asdf), with the following commands:

pip install textract2page

(After installing I got a PackageNotFoundError about a missing dependency (ocrd_modelfactory) when calling textract2page --help. As a Python illiterate, I asked ChatGPT how to fix this, and it suggested the following:

pip install ocrd_modelfactory

... which allowed the utility to run without the same error, but still was yielding 0-byte XML output.

Second, when installing from your branch, I had cloned the repo, switched to your branch, and built using the following (also suggested by ChatGPT in the same thread linked above):

pip install -r requirements.txt

Here's the output of the commands you asked about, under the original attempted installation of your branch:

❯ pip show textract2page
DEPRECATION: Loading egg at /Users/joe/.asdf/installs/python/miniconda3-3.11-24.1.2-0/lib/python3.11/site-packages/textract2page-0.0.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: textract2page
Version: 0.0.0
Summary: Convert AWS Textract JSON to PRImA PAGE XML
Home-page: 
Author: Arne Rümmler
Author-email: arne.ruemmler@gmail.com
License: Apache Software License
Location: /Users/joe/.asdf/installs/python/miniconda3-3.11-24.1.2-0/lib/python3.11/site-packages/textract2page-0.0.0-py3.11.egg
Requires: click, ocrd, pillow
Required-by: 

❯ python -c "import textract2page; print(textract2page.__version__)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/joe/workspace/textract2page/textract2page/__init__.py", line 4, in <module>
    from ._version import version
ModuleNotFoundError: No module named 'textract2page._version'

After installing as you suggested (via pip install . in the project directory where I had switched to your branch), I now get the following:

❯ pip show textract2page                                            
DEPRECATION: Loading egg at /Users/joe/.asdf/installs/python/miniconda3-3.11-24.1.2-0/lib/python3.11/site-packages/textract2page-0.0.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: textract2page
Version: 0.0.0
Summary: Convert AWS Textract JSON to PRImA PAGE XML
Home-page: 
Author: Arne Rümmler
Author-email: arne.ruemmler@gmail.com
License: Apache Software License
Location: /Users/joe/.asdf/installs/python/miniconda3-3.11-24.1.2-0/lib/python3.11/site-packages/textract2page-0.0.0-py3.11.egg
Requires: click, ocrd, pillow
Required-by: 

❯ python -c "import textract2page; print(textract2page.__version__)"
0.3.dev15+g4eb96ab

And now, once I run textract2page on both the included test files and my own, it works! Thank you!