bertsky commented 5 years ago

all: use second output position as fileGrp USE to produce AlternativeImage
all: rid of MetadataItem/Labels-related FIXME: with the updated PAGE model, we can now use @externalModel and @externalId
all: use OcrdExif.resolution instead of xResolution
all: create images with monotonically growing @comments (features)
crop: use ocrd_utils.crop_image instead of PIL.Image.crop
crop: fix bug when trying to access page_image if there are already region coordinates that we are ignoring
crop: filter images already deskewed and cropped! (we must crop ourselves, and deskewing can not happen until afterwards)
deskew: fix bugs in configuration-dependent corner cases related to whether deskewing has already been applied (on the page or region level):
- for the page image, never use images already rotated (both for page level and region level processing, but for the region level, do rotate images ad hoc if @orientation is present on the page level)
- for the region image, never use images already rotated (except for our own page-level rotation)
segment-region: forgot to add feature "cropped" when producing cropped images

bertsky commented 5 years ago

Also fixes #61.

codecov[bot] commented 5 years ago

Codecov Report

Merging #75 into master will decrease coverage by 0.82%. The diff coverage is 22.72%.

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
- Coverage   47.81%   46.99%   -0.83%     
==========================================
  Files           8        8              
  Lines         688      715      +27     
  Branches      130      134       +4     
==========================================
+ Hits          329      336       +7     
- Misses        326      346      +20     
  Partials       33       33

Impacted Files	Coverage Δ
ocrd_tesserocr/binarize.py	`22.05% <0%> (-0.67%)`	:arrow_down:
ocrd_tesserocr/deskew.py	`15.84% <3.44%> (-1.94%)`	:arrow_down:
ocrd_tesserocr/crop.py	`13.76% <4%> (-1.78%)`	:arrow_down:
ocrd_tesserocr/segment_word.py	`81.13% <66.66%> (ø)`	:arrow_up:
ocrd_tesserocr/recognize.py	`53.5% <66.66%> (+1%)`	:arrow_up:
ocrd_tesserocr/segment_line.py	`80.39% <66.66%> (ø)`	:arrow_up:
ocrd_tesserocr/segment_region.py	`73.22% <75%> (+0.11%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 9e5407d...4176747. Read the comment docs.

bertsky commented 5 years ago

Technically, the changes proposed here carry over to OCRopus and anybaseocr. Does it make sense to add abstract wrappers to core for the single processing steps (i.e. ProcessorCrop) from which the module project implementations could derive?

Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.

kba commented 5 years ago

Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.

Yeah, that would be tricky without changing the API and breaking existing code. It's a neat idea though and I wish we had a wrapper around the process method that could accomodate specialization and do startup/cleanup work but we haven't.

(e.g. requiring processors to implement a protected _process method that the base class' process method would call. But it's too late for that now)

bertsky commented 5 years ago

The last 2 commits explained:

I noticed that Tesseract internally uses 8 bit grayscale as feed for the LSTM models instead of its own Otsu binarization.

So I gathered its binarization is only needed for the non-LSTM models and layout analysis, and therefore added feature_filter='binarized' (i.e. requested raw images from the API). But then it turned out that existing models (in fact, existing training procedures) also feed binarized data. So the network would be quite perplexed to see grayscale input at runtime. This is also what I could validate by measurements on GT (tentatively): CER shoots up when using raw images for SetImage. And it shoots up even more when cropped/rotated/masked areas do not get filled with background colour but with white or transparent (since transparency is reduced to white internally). So the network is more confused about white background re-appearing in a grayscale image than white disappearing completely at runtime.

Anyway, the revert is necessary to keep up the expectations of current models, but the original commit could be re-activated when we have a different training procedure!

bertsky commented 4 years ago

Here are my CER measurements (in percent) on 2 GT bags with textual annotation, in a workflow configuration similar to this (i.e. including Olena Wolf or Ocropy nlbin-nrm binarization, Ocropy page-level deskewing, clipping and resegmentation, and dewarping).

Tesseract model	source	trained on
Fraktur	tessdata_best (Google)	binarized artificial images
GT4HistOCR	tesstrain (UB Mannheim)	greyscale-normalized scanned images

On euler_rechenkunst01_1738 (a bag with JPEG artifacts and lower resolution):

input	Fraktur	GT4HistOCR
binarized (Wolf)	9.7	11.9
raw with transparent/white background	19.7	9.9
raw with estimated background color	12.9	61.3
greyscale-normalized (ocropy-nlbin)	9.4	9.8

Results are similar in tendency for weigel_gnothi02_1618 (which is cleaner):

input	Fraktur	GT4HistOCR
binarized (Wolf)	14.0	7.3
raw with transparent/white background	36.5	8.3
raw with estimated background color	13.8	58.6
greyscale-normalized (ocropy-nlbin)	13.5	7.0

So Tesseract gets perplexed …

on a standard model (trained binarized):
- … not at all by greyscale normalized line images,
- … somewhat by median background in unnormalized greyscale line images,
- … extremely by white background in unnormalized greyscale line images,
on GT4HistOCR model (trained grayscale_normalized):
- … somewhat by binarized line images,
- … a little by white background in unnormalized greyscale line images,
- … extremely by median background in unnormalized greyscale line images

wrznr commented 4 years ago

It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.

stweil commented 4 years ago

@wrznr, CER often needs a closer look, especially what kind of errors did occur (maybe missing normalization?).

@bertsky, how can I reproduce your setting?

bertsky commented 4 years ago

It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.

True! That was also one of my messages at the workshop. Generally, I am quite certain this is due to the relatively bad quality of our GT:

very imprecise segmentation (often too coarse with much overlap, sometimes too tight with cutoff) and
suboptimal image quality (often noisy/skewed/warped, sometimes compression artifacts).

Despite all the preprocessing and resegmentation efforts, we are not able to squeeze less than 11% CER out of the whole dataset with the stock models. And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT. But maybe @stweil wants to disprove that?

GT4HistOCR also looks much cleaner. If really all they did for preprocessing was running ocropus-nlbin and ocropus-gpageseg, then their raw data already were a lot easier to begin with. (And as already argued, this might be simply the result of filtering out the lesser-quality lines.)

So, I think the situation demands:

better despeckling: At the moment, we only have ocrd-cis-ocropy-denoise (which wraps ocrolib.remove_noise for black noise, so maybe we at least should make it work for white noise as well: @Doreenruirui, I think I might just do that). But really it's time we finally got hold of some data-driven despeckling: @n00blet @mjenckel, can you offer something?
better deskewing: We currently rely on ocrd-cis-ocropy-deskew and ocrd-tesserocr-deskew, both of which improve overall results a little, but often enough fail miserably for no good (visually convincing) reason. Again, it's time we finally got some data-driven deskewing: I was struggling to get ocrorot running, but I am sure there must be more tools out there. @n00blet @mjenckel, can you offer something?
even better binarization
semi-interactively improving GT segmentation (aided by our existing clipping/resegmentation/repair tools)

bertsky commented 4 years ago

Example from our preprocessing/resegmentation pipeline: cleanly separated from neighbours, but still quite noisy

OCR-D-IMG-DEWARP_0001_r_3_1_tl_10

Example from GT4HistOCR: very clean, but skewed/warped

00008 nrm

wrznr commented 4 years ago

@bertsky @stweil So our impression

And as already argued, this might be simply the result of filtering out the lesser-quality lines.

that GT4HistOCR is suboptimal for OCR training is real.

And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT.

I am with you. But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

bertsky commented 4 years ago

@stweil The above link should be workable on the current versions. But it's probably best to use the PRs I have used: OCR-D/core#311 and OCR-D/ocrd_tesserocr#75 and cisocrgroup/ocrd_cis#16 and master cor-asv-ann (for evaluation).

CER measurement in cor-asv-ann-evaluate works as documented: vanilla Levenshtein, no normalization. This is best for comparability, results might look better (and be more fair) with different metrics and normalizations. The package offers some, but results look quite similar with other settings, even NFC.

bertsky commented 4 years ago

But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

I agree, both degradation and binarization should be employed to make GT4HistOCR models robust.

wrznr commented 4 years ago

@bertsky Is there PR in core which blocks merging here?

bertsky commented 4 years ago

Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

@stweil Is tesseract-ocr/tesstrain#73 the right place for this? Or better open a new issue strictly about binarization/augmentation (not specific to GT4HistOCR)?

BTW, according to my measurements, Fraktur performs similar on GT4HistOCR data, although not quite as bad as our GT. But of course, it really depends on the subcorpus/century:

dta19 and RefCorpus-ENHG-Incunabula and Kallimachos: 7.0% CER
dta19 only: 4.9% CER Since our GT is more diverse / has less 19th century share, worse results can be expected (independent of bad segmentation / preprocessing).

bertsky commented 4 years ago

Is there PR in core which blocks merging here?

No, not really. OCR-D/core#311 is related, but not a dependency. Thanks, I will merge this for now.

OCR-D / ocrd_tesserocr

move to AlternativeImage feature selectors in OCR-D/core#294: #75

Codecov Report