OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

possibly an invalid tessdata path: AT_SYMLINK_NOFOLLOW ? #195

Closed jbarth-ubhd closed 6 months ago

jbarth-ubhd commented 11 months ago

Did create a singularity container from docker OCR-D today, this way: singularity build ocrd.sif docker://ocrd/all:maximum

and then I started this command:

+ /home/hd/hd_hd/xx_xxxxx/local/bin/time singularity exec --bind /home/hd/hd_hd/xx_xxxxx/ocrd_models/tessdata:/usr
►/local/share/tessdata --bind /home/hd/hd_hd/xx_xxxxx/ocrd_models:/usr/local/share/ocrd-resources -e --env-file /
►home/hd/hd_hd/xx_xxxxx/ocrd.env --env MAGICK_TEMPORARY_PATH=/scratch/xx_xxxxx_job_1646883_m03n17 --env TMPDIR=/
►scratch/xx_xxxxx_job_1646883_m03n17 /home/hd/hd_hd/xx_xxxxx/ocrd.sif ocrd-tesserocr-crop -I OCR-D-001 -O OCR-D-002

here the output:

GID: readonly variable
UID: readonly variable
12:15:45.718 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-tesserocr-crop'
Traceback (most recent call last):
  File "/build/core/ocrd/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
  File "/build/ocrd_tesserocr/ocrd_tesserocr/crop.py", line 59, in process
    with tesserocr.PyTessBaseAPI() as tessapi:
  File "tesserocr.pyx", line 1219, in tesserocr.PyTessBaseAPI.__cinit__
  File "tesserocr.pyx", line 1233, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/share/tessdata/
Traceback (most recent call last):
  File "/usr/local/bin/ocrd-tesserocr-crop", line 33, in <module>
    sys.exit(load_entry_point('ocrd-tesserocr', 'console_scripts', 'ocrd-tesserocr-crop')())
...

But I never saw this error message using ocrd-tesserocr-crop, and didn't change /home/hd/hd_hd/xx_xxxxx/ocrd_models == /usr/local/share/ocrd-resources

so I did strace -f -- the only line with tessdata string:

[pid 1404221] newfstatat(AT_FDCWD, "/gpfs/bwfor/home/hd/hd_hd/xx_xxxxx/ocrd_models/tessdata", {st_mode=S_IFDIR|
►0755, st_size=8192, ...}, AT_SYMLINK_NOFOLLOW) = 0

Why AT_SYMLINK_NOFOLLOW?

Content of tessdata dir:

[xx_xxxxx@o05i14 tessdata]$ find . -type f -printf "%-50p %10s\n"|sort
./configs/alto                                             23
./configs/ambigs.train                                    146
./configs/api_config                                       26
./configs/bazaar                                          113
./configs/bigram                                          129
./configs/box.train                                       311
./configs/box.train.stderr                                311
./configs/digits                                           37
./configs/get.images                                       24
./configs/hocr                                             40
./configs/inter                                            59
./configs/kannada                                         101
./configs/linebox                                          70
./configs/logfile                                          25
./configs/lstmbox                                          26
./configs/lstmdebug                                        98
./configs/lstm.train                                      282
./configs/makebox                                          26
./configs/Makefile.am                                     365
./configs/pdf                                              22
./configs/quiet                                            21
./configs/rebox                                            65
./configs/strokewidth                                     377
./configs/tsv                                              22
./configs/txt                                             166
./configs/unlv                                             45
./configs/wordstrbox                                       29
./deu.traineddata                                     8628461
./eng.traineddata                                    15400601
./frak2021_1.069.traineddata                          5060763
./frak2021.traineddata                                3421140
./fra.traineddata                                     3972885
./GT4HistOCR_50000000.997_191951.traineddata          4591424
./osd.traineddata                                    10562727
./pdf.ttf                                                 572
./script/Latin.traineddata                          101402885
./tessconfigs/batch                                        49
./tessconfigs/batch.nochop                                 37
./tessconfigs/matdemo                                     243
./tessconfigs/msdemo                                      368
./tessconfigs/nobatch                                       1
./tessconfigs/segdemo                                     295
jbarth-ubhd commented 8 months ago

Still a problem: can't run docker→signularity image in Cluster... older versions of OCR-D did work in singularity.

jbarth-ubhd commented 6 months ago

Did update ocrd.sif from latest docker image today (2024-Feb-21) and did manage to successfully process a complete workflow with ocrd-tesseract-recognize through singularity

bertsky commented 6 months ago

@jbarth-ubhd you need to use a named volume now (because we need to mix both the user-downloaded and the pre-installed models). In Docker, this would be -v ocrd-models:/models (or some other name).

Could you please try with a recent (last week) ocrd/tesserocr image?

jbarth-ubhd commented 6 months ago

yes, see https://github.com/OCR-D/ocrd_tesserocr/issues/195#issuecomment-1957230110 . Did read the current documentation and changed singularity startup parameters accordingly.

bertsky commented 6 months ago

Oh, ok, so this can be closed (is what you are saying)?

jbarth-ubhd commented 6 months ago

Yes