Closed bertsky closed 4 years ago
Also fixes #61.
Merging #75 into master will decrease coverage by
0.82%
. The diff coverage is22.72%
.
@@ Coverage Diff @@
## master #75 +/- ##
==========================================
- Coverage 47.81% 46.99% -0.83%
==========================================
Files 8 8
Lines 688 715 +27
Branches 130 134 +4
==========================================
+ Hits 329 336 +7
- Misses 326 346 +20
Partials 33 33
Impacted Files | Coverage Δ | |
---|---|---|
ocrd_tesserocr/binarize.py | 22.05% <0%> (-0.67%) |
:arrow_down: |
ocrd_tesserocr/deskew.py | 15.84% <3.44%> (-1.94%) |
:arrow_down: |
ocrd_tesserocr/crop.py | 13.76% <4%> (-1.78%) |
:arrow_down: |
ocrd_tesserocr/segment_word.py | 81.13% <66.66%> (ø) |
:arrow_up: |
ocrd_tesserocr/recognize.py | 53.5% <66.66%> (+1%) |
:arrow_up: |
ocrd_tesserocr/segment_line.py | 80.39% <66.66%> (ø) |
:arrow_up: |
ocrd_tesserocr/segment_region.py | 73.22% <75%> (+0.11%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 9e5407d...4176747. Read the comment docs.
Technically, the changes proposed here carry over to
OCRopus
andanybaseocr
. Does it make sense to add abstract wrappers tocore
for the single processing steps (i.e.ProcessorCrop
) from which the module project implementations could derive?
Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process
method, with custom code scattered across different conditionals and loop bodies.
Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.
Yeah, that would be tricky without changing the API and breaking existing code. It's a neat idea though and I wish we had a wrapper around the process
method that could accomodate specialization and do startup/cleanup work but we haven't.
(e.g. requiring processors to implement a protected _process
method that the base class' process
method would call. But it's too late for that now)
The last 2 commits explained:
I noticed that Tesseract internally uses 8 bit grayscale as feed for the LSTM models instead of its own Otsu binarization.
So I gathered its binarization is only needed for the non-LSTM models and layout analysis, and therefore added feature_filter='binarized'
(i.e. requested raw images from the API). But then it turned out that existing models (in fact, existing training procedures) also feed binarized data. So the network would be quite perplexed to see grayscale input at runtime. This is also what I could validate by measurements on GT (tentatively): CER shoots up when using raw images for SetImage
. And it shoots up even more when cropped/rotated/masked areas do not get filled with background colour but with white or transparent (since transparency is reduced to white internally). So the network is more confused about white background re-appearing in a grayscale image than white disappearing completely at runtime.
Anyway, the revert is necessary to keep up the expectations of current models, but the original commit could be re-activated when we have a different training procedure!
Here are my CER measurements (in percent) on 2 GT bags with textual annotation, in a workflow configuration similar to this (i.e. including Olena Wolf or Ocropy nlbin-nrm binarization, Ocropy page-level deskewing, clipping and resegmentation, and dewarping).
Tesseract model | source | trained on |
---|---|---|
Fraktur | tessdata_best (Google) | binarized artificial images |
GT4HistOCR | tesstrain (UB Mannheim) | greyscale-normalized scanned images |
On euler_rechenkunst01_1738
(a bag with JPEG artifacts and lower resolution):
input | Fraktur | GT4HistOCR |
---|---|---|
binarized (Wolf) | 9.7 | 11.9 |
raw with transparent/white background | 19.7 | 9.9 |
raw with estimated background color | 12.9 | 61.3 |
greyscale-normalized (ocropy-nlbin) | 9.4 | 9.8 |
Results are similar in tendency for weigel_gnothi02_1618
(which is cleaner):
input | Fraktur | GT4HistOCR |
---|---|---|
binarized (Wolf) | 14.0 | 7.3 |
raw with transparent/white background | 36.5 | 8.3 |
raw with estimated background color | 13.8 | 58.6 |
greyscale-normalized (ocropy-nlbin) | 13.5 | 7.0 |
So Tesseract gets perplexed …
binarized
):
grayscale_normalized
):
It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.
@wrznr, CER often needs a closer look, especially what kind of errors did occur (maybe missing normalization?).
@bertsky, how can I reproduce your setting?
It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.
True! That was also one of my messages at the workshop. Generally, I am quite certain this is due to the relatively bad quality of our GT:
Despite all the preprocessing and resegmentation efforts, we are not able to squeeze less than 11% CER out of the whole dataset with the stock models. And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT. But maybe @stweil wants to disprove that?
GT4HistOCR also looks much cleaner. If really all they did for preprocessing was running ocropus-nlbin
and ocropus-gpageseg
, then their raw data already were a lot easier to begin with. (And as already argued, this might be simply the result of filtering out the lesser-quality lines.)
So, I think the situation demands:
better despeckling: At the moment, we only have ocrd-cis-ocropy-denoise (which wraps ocrolib.remove_noise
for black noise, so maybe we at least should make it work for white noise as well: @Doreenruirui, I think I might just do that). But really it's time we finally got hold of some data-driven despeckling: @n00blet @mjenckel, can you offer something?
better deskewing: We currently rely on ocrd-cis-ocropy-deskew and ocrd-tesserocr-deskew, both of which improve overall results a little, but often enough fail miserably for no good (visually convincing) reason. Again, it's time we finally got some data-driven deskewing: I was struggling to get ocrorot running, but I am sure there must be more tools out there. @n00blet @mjenckel, can you offer something?
even better binarization
semi-interactively improving GT segmentation (aided by our existing clipping/resegmentation/repair tools)
Example from our preprocessing/resegmentation pipeline: cleanly separated from neighbours, but still quite noisy
Example from GT4HistOCR: very clean, but skewed/warped
@bertsky @stweil So our impression
And as already argued, this might be simply the result of filtering out the lesser-quality lines.
that GT4HistOCR is suboptimal for OCR training is real.
And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT.
I am with you. But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain
?
@stweil The above link should be workable on the current versions. But it's probably best to use the PRs I have used: OCR-D/core#311 and OCR-D/ocrd_tesserocr#75 and cisocrgroup/ocrd_cis#16 and master cor-asv-ann (for evaluation).
CER measurement in cor-asv-ann-evaluate works as documented: vanilla Levenshtein, no normalization. This is best for comparability, results might look better (and be more fair) with different metrics and normalizations. The package offers some, but results look quite similar with other settings, even NFC
.
But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in
tesstrain
?
I agree, both degradation and binarization should be employed to make GT4HistOCR models robust.
@bertsky Is there PR in core which blocks merging here?
Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in
tesstrain
?
@stweil Is tesseract-ocr/tesstrain#73 the right place for this? Or better open a new issue strictly about binarization/augmentation (not specific to GT4HistOCR)?
BTW, according to my measurements, Fraktur
performs similar on GT4HistOCR data, although not quite as bad as our GT. But of course, it really depends on the subcorpus/century:
dta19
and RefCorpus-ENHG-Incunabula
and Kallimachos
: 7.0% CERdta19
only: 4.9% CER
Since our GT is more diverse / has less 19th century share, worse results can be expected (independent of bad segmentation / preprocessing).Is there PR in core which blocks merging here?
No, not really. OCR-D/core#311 is related, but not a dependency. Thanks, I will merge this for now.