OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

move to AlternativeImage feature selectors in OCR-D/core#294: #75

Closed bertsky closed 4 years ago

bertsky commented 5 years ago
bertsky commented 5 years ago

Also fixes #61.

codecov[bot] commented 5 years ago

Codecov Report

Merging #75 into master will decrease coverage by 0.82%. The diff coverage is 22.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
- Coverage   47.81%   46.99%   -0.83%     
==========================================
  Files           8        8              
  Lines         688      715      +27     
  Branches      130      134       +4     
==========================================
+ Hits          329      336       +7     
- Misses        326      346      +20     
  Partials       33       33
Impacted Files Coverage Δ
ocrd_tesserocr/binarize.py 22.05% <0%> (-0.67%) :arrow_down:
ocrd_tesserocr/deskew.py 15.84% <3.44%> (-1.94%) :arrow_down:
ocrd_tesserocr/crop.py 13.76% <4%> (-1.78%) :arrow_down:
ocrd_tesserocr/segment_word.py 81.13% <66.66%> (ø) :arrow_up:
ocrd_tesserocr/recognize.py 53.5% <66.66%> (+1%) :arrow_up:
ocrd_tesserocr/segment_line.py 80.39% <66.66%> (ø) :arrow_up:
ocrd_tesserocr/segment_region.py 73.22% <75%> (+0.11%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 9e5407d...4176747. Read the comment docs.

bertsky commented 5 years ago

Technically, the changes proposed here carry over to OCRopus and anybaseocr. Does it make sense to add abstract wrappers to core for the single processing steps (i.e. ProcessorCrop) from which the module project implementations could derive?

Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.

kba commented 5 years ago

Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.

Yeah, that would be tricky without changing the API and breaking existing code. It's a neat idea though and I wish we had a wrapper around the process method that could accomodate specialization and do startup/cleanup work but we haven't.

(e.g. requiring processors to implement a protected _process method that the base class' process method would call. But it's too late for that now)

bertsky commented 5 years ago

The last 2 commits explained:

I noticed that Tesseract internally uses 8 bit grayscale as feed for the LSTM models instead of its own Otsu binarization.

So I gathered its binarization is only needed for the non-LSTM models and layout analysis, and therefore added feature_filter='binarized' (i.e. requested raw images from the API). But then it turned out that existing models (in fact, existing training procedures) also feed binarized data. So the network would be quite perplexed to see grayscale input at runtime. This is also what I could validate by measurements on GT (tentatively): CER shoots up when using raw images for SetImage. And it shoots up even more when cropped/rotated/masked areas do not get filled with background colour but with white or transparent (since transparency is reduced to white internally). So the network is more confused about white background re-appearing in a grayscale image than white disappearing completely at runtime.

Anyway, the revert is necessary to keep up the expectations of current models, but the original commit could be re-activated when we have a different training procedure!

bertsky commented 4 years ago

Here are my CER measurements (in percent) on 2 GT bags with textual annotation, in a workflow configuration similar to this (i.e. including Olena Wolf or Ocropy nlbin-nrm binarization, Ocropy page-level deskewing, clipping and resegmentation, and dewarping).

Tesseract model source trained on
Fraktur tessdata_best (Google) binarized artificial images
GT4HistOCR tesstrain (UB Mannheim) greyscale-normalized scanned images

On euler_rechenkunst01_1738 (a bag with JPEG artifacts and lower resolution):

input Fraktur GT4HistOCR
binarized (Wolf) 9.7 11.9
raw with transparent/white background 19.7 9.9
raw with estimated background color 12.9 61.3
greyscale-normalized (ocropy-nlbin) 9.4 9.8

Results are similar in tendency for weigel_gnothi02_1618 (which is cleaner):

input Fraktur GT4HistOCR
binarized (Wolf) 14.0 7.3
raw with transparent/white background 36.5 8.3
raw with estimated background color 13.8 58.6
greyscale-normalized (ocropy-nlbin) 13.5 7.0

So Tesseract gets perplexed …

wrznr commented 4 years ago

It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.

stweil commented 4 years ago

@wrznr, CER often needs a closer look, especially what kind of errors did occur (maybe missing normalization?).

@bertsky, how can I reproduce your setting?

bertsky commented 4 years ago

It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.

True! That was also one of my messages at the workshop. Generally, I am quite certain this is due to the relatively bad quality of our GT:

Despite all the preprocessing and resegmentation efforts, we are not able to squeeze less than 11% CER out of the whole dataset with the stock models. And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT. But maybe @stweil wants to disprove that?

GT4HistOCR also looks much cleaner. If really all they did for preprocessing was running ocropus-nlbin and ocropus-gpageseg, then their raw data already were a lot easier to begin with. (And as already argued, this might be simply the result of filtering out the lesser-quality lines.)

So, I think the situation demands:

bertsky commented 4 years ago

Example from our preprocessing/resegmentation pipeline: cleanly separated from neighbours, but still quite noisy

OCR-D-IMG-DEWARP_0001_r_3_1_tl_10

Example from GT4HistOCR: very clean, but skewed/warped

00008 nrm

wrznr commented 4 years ago

@bertsky @stweil So our impression

And as already argued, this might be simply the result of filtering out the lesser-quality lines.

that GT4HistOCR is suboptimal for OCR training is real.

And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT.

I am with you. But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

bertsky commented 4 years ago

@stweil The above link should be workable on the current versions. But it's probably best to use the PRs I have used: OCR-D/core#311 and OCR-D/ocrd_tesserocr#75 and cisocrgroup/ocrd_cis#16 and master cor-asv-ann (for evaluation).

CER measurement in cor-asv-ann-evaluate works as documented: vanilla Levenshtein, no normalization. This is best for comparability, results might look better (and be more fair) with different metrics and normalizations. The package offers some, but results look quite similar with other settings, even NFC.

bertsky commented 4 years ago

But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

I agree, both degradation and binarization should be employed to make GT4HistOCR models robust.

wrznr commented 4 years ago

@bertsky Is there PR in core which blocks merging here?

bertsky commented 4 years ago

Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

@stweil Is tesseract-ocr/tesstrain#73 the right place for this? Or better open a new issue strictly about binarization/augmentation (not specific to GT4HistOCR)?

BTW, according to my measurements, Fraktur performs similar on GT4HistOCR data, although not quite as bad as our GT. But of course, it really depends on the subcorpus/century:

bertsky commented 4 years ago

Is there PR in core which blocks merging here?

No, not really. OCR-D/core#311 is related, but not a dependency. Thanks, I will merge this for now.