Closed jbarth-ubhd closed 7 months ago
Oops! The CI did not catch this, since it instantiates the processor differently. This broke recently when OCR-D/core changed the Processor initialization in run_processor
(get_processor, then assigning workspace post-hoc).
I had already adapted this locally but did not push yet. Can you please try again with current master?
I.e. in ocrd/all, just do:
git -C /build/ocrd_keraslm checkout master
git -C /build/ocrd_keraslm pull origin master
make -C /build -W ocrd_keraslm ocrd-keraslm-rate
tried building ocrd, but ... uh dependency hell:
In file included from /home/jb/ocrd_all/ocrd_olena/repo/olena/milena/mln/io/
►magick/all.hh:44,
from /home/jb/ocrd_all/ocrd_olena/repo/olena/scribo/src/
►binarization/global_threshold.cc:29:
/home/jb/ocrd_all/ocrd_olena/repo/olena/milena/mln/io/magick/load.hh: In
► function ‘void mln::io::magick::load(mln::Image<I>&, const string&)’:
/home/jb/ocrd_all/ocrd_olena/repo/olena/milena/mln/io/magick/load.hh:191:10:
► error: ‘PixelPacket’ is not a member of ‘Magick’; did you mean ‘MagickCore::
►PixelPacket’?
191 | Magick::PixelPacket* pixels = view.get(0, 0, ima.ncols(), ima.nrows());
| ^~~~~~~~~~~
In file included from /usr/local/include/ImageMagick-7/MagickCore/stream.h:25,
...
Wait, that looks like a native build of ocrd_all from scratch – I thought you were in Docker?
For a native installation, just follow the Setup Guide, with the difference that you need the ocrd_keraslm update:
cd ocrd_all
git pull
make modules
sudo make deps-ubuntu
git -C ocrd_keraslm checkout master
git -C ocrd_keraslm pull origin master
make all NO_UPDATE=1
Or, if you already had the other ocrd_all modules, just do the equivalent of the above Docker recipe:
git -C ocrd_keraslm checkout master
git -C ocrd_keraslm pull origin master
make -W ocrd_keraslm ocrd-keraslm-rate NO_UPDATE=1
Just saw git
and tought from source
Let's try again...
jb@pers16:~> docker-ocrd git -C /build/ocrd_keraslm checkout master
Previous HEAD position was 472197f update assets
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
jb@pers16:~> docker-ocrd git -C /build/ocrd_keraslm pull origin master
From https://github.com/OCR-D/ocrd_keraslm
* branch master -> FETCH_HEAD
b996c82..ea79b2a master -> origin/master
Updating 472197f..ea79b2a
Fast-forward
.circleci/config.yml | 10 +-
CHANGELOG.md | 12 +++
Makefile | 14 ++-
README.md | 183 +++++++++++++++++++++++++-----------
ocrd_keraslm/lib/rating.py | 56 ++++++++---
ocrd_keraslm/scripts/run.py | 51 +++++++---
ocrd_keraslm/wrapper/ocrd-tool.json | 12 ++-
ocrd_keraslm/wrapper/rate.py | 89 +++++++++++-------
setup.py | 1 +
test/test_wrapper.py | 9 +-
10 files changed, 308 insertions(+), 129 deletions(-)
jb@pers16:~> docker-ocrd make -C /build -W ocrd_keraslm ocrd-keraslm-rate
make: Entering directory '/build'
make -o ocrd_keraslm ocrd-keraslm-rate keraslm-rate VIRTUAL_ENV=/usr/local/sub-venv/headless-tf1
make[1]: Entering directory '/build'
make[1]: Nothing to be done for 'ocrd-keraslm-rate'.
make[1]: Nothing to be done for 'keraslm-rate'.
make[1]: Leaving directory '/build'
chmod +x /usr/local/bin/ocrd-keraslm-rate /usr/local/bin/keraslm-rate
make: Leaving directory '/build'
What exactly does your docker-ocrd
do?
perhaps not the right thing for persistency:
jb@pers16:~/workspace/ocrd-keras> cat /usr/local/bin/docker-ocrd
#!/bin/bash
docker_ocrd () {
models_in_container="/models"
if echo "$@" | grep -q ocrd-tesser
then
models_in_container="/usr/local/share" # https://github.com/OCR-D/ocrd_all/issues/394#issue-1950168885
fi
# $time singularity exec --bind $TMPDIR:/tmp --bind .:/data --bind $HOME/ocrd_models:$models_in_container -e --env-file $HOME/ocrd.env $HOME/ocrd.sif "$@"
docker run --rm -u 0 -v $PWD:/data -v /home/jb/ocrd-models:$models_in_container -w /data -- ocrd/all:maximum "$@"
}
docker_ocrd "$@"
jb@pers16:~> docker run -u 0 -it --name "kerasxx" -- ocrd/all:maximum bash
/data$ git -C /build/ocrd_keraslm checkout master
Previous HEAD position was 472197f update assets
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
/data$ git -C /build/ocrd_keraslm pull origin master
remote: Enumerating objects: 41, done.
remote: Counting objects: 100% (41/41), done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 41 (delta 25), reused 41 (delta 25), pack-reused 0
Unpacking objects: 100% (41/41), 8.66 KiB | 554.00 KiB/s, done.
From https://github.com/OCR-D/ocrd_keraslm
* branch master -> FETCH_HEAD
b996c82..ea79b2a master -> origin/master
Updating b996c82..ea79b2a
Fast-forward
.circleci/config.yml | 10 +++---
Makefile | 14 ++++++--
README.md | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------
ocrd_keraslm/lib/rating.py | 29 ++++++++++++++---
ocrd_keraslm/scripts/run.py | 23 ++++++++++---
ocrd_keraslm/wrapper/rate.py | 89 ++++++++++++++++++++++++++++++++------------------
test/test_wrapper.py | 9 +++---
7 files changed, 216 insertions(+), 123 deletions(-)
/data$ make -C /build -W ocrd_keraslm ocrd-keraslm-rate
make: Entering directory '/build'
make -o ocrd_keraslm ocrd-keraslm-rate keraslm-rate VIRTUAL_ENV=/usr/local/sub-venv/headless-tf1
make[1]: Entering directory '/build'
make[1]: Nothing to be done for 'ocrd-keraslm-rate'.
make[1]: Nothing to be done for 'keraslm-rate'.
make[1]: Leaving directory '/build'
chmod +x /usr/local/bin/ocrd-keraslm-rate /usr/local/bin/keraslm-rate
make: Leaving directory '/build'
/data$
jb@pers16:~> docker commit -p -a "Jochen" -m "keras.." 2d1f96764b44 ocrd_kerasxx
sha256:174de6aac01422a69e3cc74238fdc00bdea9d30d647914d5bd3b1001b0d11444
Ah, ok, updating in sub-venvs has become more difficult now. Simplest way…
# in Docker
rm /usr/local/sub-venv/headless-tf1/bin/ocrd-keraslm-rate
# native venv
rm venv/sub-venv/headless-tf1/bin/ocrd-keraslm-rate
…then the above make
call
/data$ rm /usr/local/sub-venv/headless-tf1/bin/ocrd-keraslm-rate
/data$ git -C /build/ocrd_keraslm checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
/data$ git -C /build/ocrd_keraslm pull origin master
From https://github.com/OCR-D/ocrd_keraslm
* branch master -> FETCH_HEAD
Already up to date.
/data$ make -C /build -W ocrd_keraslm ocrd-keraslm-rate
make: Entering directory '/build'
make -o ocrd_keraslm ocrd-keraslm-rate keraslm-rate VIRTUAL_ENV=/usr/local/sub-
►venv/headless-tf1
make[1]: Entering directory '/build'
. /usr/local/sub-venv/headless-tf1/bin/activate && if test 3.8 = 3.8 && ! pip
► show -q tensorflow-gpu; then sem -q --will-cite --fg --id ocrd_all_pipheadless-
►tf1 pip install nvidia-pyindex && pushd $(mktemp -d) && sem -q --will-cite --fg
► --id ocrd_all_pipheadless-tf1 pip download --no-deps "nvidia-tensorflow==1.15.5
►+nv22.12" && for name in nvidia_tensorflow-*.whl; do name=${name%.whl}; done &&
► python3 -m wheel unpack $name.whl && for name in nvidia_tensorflow-*/; do name
►=${name%/}; done && newname=${name/nvidia_tensorflow/tensorflow_gpu} && sed -i s
►/nvidia_tensorflow/tensorflow_gpu/g $name/$name.dist-info/METADATA && sed -i s/
►nvidia_tensorflow/tensorflow_gpu/g $name/$name.dist-info/RECORD && sed -i s/
►nvidia_tensorflow/tensorflow_gpu/g $name/tensorflow_core/tools/pip_package/setup
►.py && pushd $name && for path in $name*; do mv $path ${path/$name/$newname};
► done && popd && python3 -m wheel pack $name && sem -q --will-cite --fg --id
► ocrd_all_pipheadless-tf1 pip install --no-cache-dir $newname*.whl && popd && rm
► -fr $OLDPWD; fi
# - preempt conflict over numpy between scikit-image and tensorflow
# - preempt conflict over numpy between tifffile and tensorflow (and allow py36)
. /usr/local/sub-venv/headless-tf1/bin/activate && sem -q --will-cite --fg --id
► ocrd_all_pipheadless-tf1 pip install imageio==2.14.1 "tifffile<2022"
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting imageio==2.14.1
Downloading imageio-2.14.1-py3-none-any.whl.metadata (4.0 kB)
Collecting tifffile<2022
Downloading tifffile-2021.11.2-py3-none-any.whl.metadata (29 kB)
Requirement already satisfied: numpy in /usr/local/sub-venv/headless-tf1/lib/
►python3.8/site-packages (from imageio==2.14.1) (1.23.5)
Requirement already satisfied: pillow>=8.3.2 in /usr/local/sub-venv/headless-tf
►1/lib/python3.8/site-packages (from imageio==2.14.1) (10.2.0)
Downloading imageio-2.14.1-py3-none-any.whl (3.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 53.1 MB/s eta 0:00:00
Downloading tifffile-2021.11.2-py3-none-any.whl (178 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 178.9/178.9 kB 147.8 MB/s eta 0:00:00
Installing collected packages: tifffile, imageio
Attempting uninstall: tifffile
Found existing installation: tifffile 2023.7.10
Uninstalling tifffile-2023.7.10:
Successfully uninstalled tifffile-2023.7.10
Attempting uninstall: imageio
Found existing installation: imageio 2.34.0
Uninstalling imageio-2.34.0:
Successfully uninstalled imageio-2.34.0
Successfully installed imageio-2.14.1 tifffile-2021.11.2
ERROR: pip's dependency resolver does not currently take into account all the
► packages that are installed. This behaviour is the source of the following
► dependency conflicts.
scikit-image 0.21.0 requires imageio>=2.27, but you have imageio 2.14.1 which is
► incompatible.
scikit-image 0.21.0 requires tifffile>=2022.8.12, but you have tifffile 2021.11.
►2 which is incompatible.
# - preempt conflict over numpy between h5py and tensorflow
. /usr/local/sub-venv/headless-tf1/bin/activate && sem -q --will-cite --fg --id
► ocrd_all_pipheadless-tf1 pip install "numpy<1.24"
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: numpy<1.24 in /usr/local/sub-venv/headless-tf1/
►lib/python3.8/site-packages (1.23.5)
. /usr/local/sub-venv/headless-tf1/bin/activate && cd ocrd_keraslm && sem -q --
►will-cite --fg --id ocrd_all_pipheadless-tf1 pip install --timeout=3000 -e . &&
► touch -c /usr/local/sub-venv/headless-tf1/bin/ocrd-keraslm-rate
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///build/ocrd_keraslm
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: ocrd>=2.13.1 in /usr/local/sub-venv/headless-tf1/
►lib/python3.8/site-packages (from ocrd_keraslm==0.4.3) (2.63.3)
...
...
Installing collected packages: ocrd_keraslm
Attempting uninstall: ocrd_keraslm
Found existing installation: ocrd_keraslm 0.4.2
Uninstalling ocrd_keraslm-0.4.2:
Successfully uninstalled ocrd_keraslm-0.4.2
Running setup.py develop for ocrd_keraslm
Successfully installed ocrd_keraslm-0.4.3
make[1]: Nothing to be done for 'keraslm-rate'.
make[1]: Leaving directory '/build'
chmod +x /usr/local/bin/ocrd-keraslm-rate /usr/local/bin/keraslm-rate
make: Leaving directory '/build'
/data$
perhaps not the right thing for persistency:
that instantiates a new container with each invocation, so nothing will be shared/persisted, and none of the above recipes would work.
If you really want this behaviour, then use docker exec
instead of docker run
and try to just reuse the same container each time.
EDIT: also, there should be no need for the /models workaround anymore for tessdata.
So, does it work now?
works:
jb@pers16:~/workspace/ocrd-keras> ./run.sh
+ set -e
+ docker-ocrd ocrd-keraslm-rate -I OCR-D-OCR -O OCR-D-KERAS -P model_file model_
►dta_full.h5 -P textequiv_level word -P alternative_decoding false
Using TensorFlow backend.
11:53:51.700 WARNING root - Limited tf.compat.v2.summary API due to missing
► TensorBoard installation.
11:53:51.852 INFO processor.KerasRate - using CPU LSTM implementation to compile
► stateful contiguous model of depth 2 width 128 length 256 size 1273
11:53:52.407 INFO processor.KerasRate - INPUT FILE 0 / p0002
11:53:52.440 INFO processor.KerasRate - Scoring text in page 'OCR-D-OCR_test-
►fouche10_5' at the word level
11:53:52.441 INFO ocrd.page_validator.validate - Validating input file 'OCR-D-
►OCR_test-fouche10_5'
11:53:52.677 INFO processor.KerasRate - Rating 1003 elements with a total of
► 3383 characters
1/14 [=>............................] - ETA: 2s
2/14 [===>..........................] - ETA: 1s
3/14 [=====>........................] - ETA: 1s
5/14 [=========>....................] - ETA: 0s
6/14 [===========>..................] - ETA: 0s
7/14 [==============>...............] - ETA: 0s
8/14 [================>.............] - ETA: 0s
9/14 [==================>...........] - ETA: 0s
10/14 [====================>.........] - ETA: 0s
11/14 [======================>.......] - ETA: 0s
12/14 [========================>.....] - ETA: 0s
13/14 [==========================>...] - ETA: 0s
14/14 [==============================] - 1s 70ms/step
11:53:53.703 INFO processor.KerasRate - avg: 0.334, char ppl: 7.185, word ppl:
► 773.807
11:53:53.719 INFO ocrd.process.profile - Executing processor 'ocrd-keraslm-rate
►' took 1.312577s (wall) 2.464611s (CPU)( [--input-file-grp='OCR-D-OCR' --output-
►file-grp='OCR-D-KERAS' --parameter='{"model_file": "model_dta_full.h5", "
►textequiv_level": "word", "alternative_decoding": false, "beam_width": 10, "lm_
►weight": 0.5}' --page-id='']
so conf=
-Attributes are overwritten?
35a36,48
> <pc:MetadataItem type="processingStep" name="recognition/text-recognition" value="ocrd-keraslm-rate">
> <pc:Labels externalModel="ocrd-tool" externalId="parameters">
> <pc:Label value="model_dta_full.h5" type="model_file"/>
> <pc:Label value="word" type="textequiv_level"/>
> <pc:Label value="False" type="alternative_decoding"/>
> <pc:Label value="10" type="beam_width"/>
> <pc:Label value="0.5" type="lm_weight"/>
> </pc:Labels>
> <pc:Labels externalModel="ocrd-tool" externalId="version">
> <pc:Label value="0.4.3" type="ocrd-keraslm-rate"/>
> <pc:Label value="2.63.3" type="ocrd/core"/>
> </pc:Labels>
> </pc:MetadataItem>
53c66
< <pc:TextEquiv conf="0.966855773925781">
---
> <pc:TextEquiv conf="0.736842163484543">
59c72
< <pc:TextEquiv conf="0.962821807861328">
---
> <pc:TextEquiv conf="0.62280951615423">
65c78
< <pc:TextEquiv conf="0.966287384033203">
---
> <pc:TextEquiv conf="0.677824027637641">
so
conf=
-Attributes are overwritten?
Yes, see docstring or --help
or readme.
In docker, ocrd/all:maximum