OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

Drop custom tessdata location, properly support moduledir #187

Closed kba closed 1 year ago

kba commented 2 years ago

This is the companion to OCR-D/core#904, doing away with overriding the path to tessdata and instead using the self.moduledir mechanism to tell core where it expects models to be stored.

Draft because OCR-D/core#904 needs to be finished before properly testing this.

codecov[bot] commented 2 years ago

Codecov Report

Merging #187 (1370666) into master (0427f6f) will decrease coverage by 1.42%. The diff coverage is 60.00%.

@@            Coverage Diff             @@
##           master     #187      +/-   ##
==========================================
- Coverage   28.91%   27.49%   -1.43%     
==========================================
  Files          12       11       -1     
  Lines        1404     1375      -29     
  Branches      331      326       -5     
==========================================
- Hits          406      378      -28     
- Misses        942      943       +1     
+ Partials       56       54       -2     
Impacted Files Coverage Δ
ocrd_tesserocr/config.py 100.00% <ø> (+18.18%) :arrow_up:
ocrd_tesserocr/segment.py 36.00% <0.00%> (-1.50%) :arrow_down:
ocrd_tesserocr/segment_table.py 34.61% <0.00%> (-1.39%) :arrow_down:
ocrd_tesserocr/binarize.py 18.57% <50.00%> (ø)
ocrd_tesserocr/crop.py 14.28% <50.00%> (+0.61%) :arrow_up:
ocrd_tesserocr/fontshape.py 17.64% <50.00%> (ø)
ocrd_tesserocr/recognize.py 25.71% <75.00%> (-1.51%) :arrow_down:
ocrd_tesserocr/deskew.py 13.39% <100.00%> (ø)
ocrd_tesserocr/segment_line.py 96.15% <100.00%> (+0.15%) :arrow_up:
ocrd_tesserocr/segment_region.py 96.42% <100.00%> (+0.13%) :arrow_up:
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

bertsky commented 1 year ago

So the situation with CI failure is this: despite the recent changes to test_cli, we still cannot properly simulate overriding the TESSDATA_PREFIX envvar. It does work on the CLI, but within pytest, it seems the in-line module initialization somehow gets called before the monkeypatching.

(And it does not help running with --import-mode=append or --import-mode=importlib, either.)

So what do we do? Skip this kind of test? Use some other mechanism for running the CLI?

kba commented 1 year ago

So what do we do? Skip this kind of test? Use some other mechanism for running the CLI?

We could set the envvar in the make test call, like we do for isolated logging tests in core - not the most elegant solution but probably easier to debug than the module-loading/pytest/monkeypatch mechanisms.

bertsky commented 1 year ago

Note: I found out that it's the other test modules which drag in tesserocr initialization before any test gets run. (Just copy test_cli.py somewhere else and let it run alone.) Sadly, knowing this, I still have found no way to de-initialize this module (probably because it's Cython). For example,

    monkeypatch.delitem(sys.modules, 'tesserocr')
    monkeypatch.delitem(sys.modules, 'ocrd_tesserocr')

does not help. Neither does using importlib.reload.

bertsky commented 1 year ago

We could set the envvar in the make test call, like we do for isolated logging tests in core - not the most elegant solution but probably easier to debug than the module-loading/pytest/monkeypatch mechanisms.

That would not allow setting isolated temporary directories, though. (Also, by setting TESSDATA_PREFIX for all tests, the other tests would not find our true model files.)

In hope of fixing this properly some day, I just used the following workaround: running the tmpdir-based test_cli separate from the other tests.