OCR-D / ocrd_kraken

Wrapper for the kraken OCR engine
Apache License 2.0
11 stars 6 forks source link

Rewrite #33

Closed kba closed 2 years ago

kba commented 3 years ago

Rewrite the segmentation and add recognition with support for the upcoming kraken 3.0`

TODO

bertsky commented 3 years ago

Here's an example:

Another example:

bertsky commented 3 years ago

And here's a crop from the first example again, this time after 232a055 (enlarging regions to avoid extruding lines) and using PageViewer with https://github.com/PRImA-Research-Lab/prima-page-viewer/pull/18 to show the baselines:

FILE_0002_LINES_KRAKEN_pageviewer-baselines

Perhaps @mittagessen should comment on what we are seeing here. Is this what we should expect from the blla segmenter, or are these caused by bugs in kraken or bad wrapping in ocrd_kraken?

mittagessen commented 3 years ago

I guess its just (very) crappy output caused by a combination of the default model being only trained on handwritten text and high line count on the page which tends to cause the line merging you're seeing. We have changed the postprocessing a few weeks ago which is a bit more sensitive to low-confidence detection but solved a number of other rather annoying problems. Retraining the default model on a larger resolution than now should largely resolve the problem and is fairly high up on my todo list.

Nothing wrong on your part. Try manuscripts next time ;). I'd also gladly take high line-count (>30) print datasets to incorporate that into the default model training data.

bertsky commented 3 years ago

I guess its just (very) crappy output caused by a combination of the default model being only trained on handwritten text and high line count on the page which tends to cause the line merging you're seeing. We have changed the postprocessing a few weeks ago which is a bit more sensitive to low-confidence detection but solved a number of other rather annoying problems. Retraining the default model on a larger resolution than now should largely resolve the problem and is fairly high up on my todo list. Nothing wrong on your part. Try manuscripts next time ;). I'd also gladly take high line-count (>30) print datasets to incorporate that into the default model training data.

Thanks @mittagessen for these explanations. That makes we wonder whether you train on (fixed size) crops/tiles or on full images (see here for a study which tiling options work best), and whether you account for different pixel densities (see here for how the Qurator team deals with this)). (Using Mask-RCNN I used to get problems when mixing books and newspapers, esp. when DPI varied...) Could you please elaborate on your take regarding these aspects?

As to datasets, how about PubLayNet (very large, but modern/synthetic), and datasets listed here under metric: pages (mid-size, historic)?

(EDIT: Some of these would need additional effort to get to text lines; or could you train the regions exclusively with blla?)

@kba maybe in light of this, it makes most sense to also wrap both the blla and legacy line detectors in a region2line mode (or level-of-operation=region)?

mittagessen commented 3 years ago

Could you please elaborate on your take regarding these aspects?

The net is trained on full pages with a normalized page height, per default 1200px to keep the memory consumption below 5Gb for the method. As mentioned the line merging disappears for all but the craziest material (rotuli, maps, some inscriptions, newspapers) when increasing this to 1600px (~33% line separation at net scale). That's been a design decision since the first iteration of the method (U-Net) as even with the standard tiling techniques we never got completely rid of border effects. The current method is more or less a ReNet which might perform even worse with the reduced context tiling provides but I haven't evaluated it extensively. It is on our radar though as wanting to be able to process the crazy stuff is our shtick.

We haven't encountered any issues relating to scale as described by the qurator people. Anything between 75dpi to 600dpi+ seems to work reasonably well in the same model even if not trained on that resolution and with different input heights of the model. I'd guess that's largely because even low resolution scans are at worst only ~50% smaller than the 'native' input size so resizing effects are rather modest.

As to datasets, how about PubLayNet (very large, but modern/synthetic)

Unfortunately, that one doesn't have the baselines necessary to train the segmenter.

here under

I'll sift through these in a bit, thx.

bertsky commented 3 years ago

As to datasets, how about PubLayNet (very large, but modern/synthetic) Unfortunately, that one doesn't have the baselines necessary to train the segmenter.

Yes, and it has more issues. Unfortunately, they did not publish the method or data of their PDF-XML alignment, so all we could do is post-process.

But...

(EDIT: Some of these would need additional effort to get to text lines; or could you train the regions exclusively with blla?)

...is that even an option?

I'll sift through these in a bit, thx.

Also, there are quite a few more (not yet properly listed) under https://github.com/cneud/ocr-gt/issues

mittagessen commented 3 years ago

...is that even an option?

Yes, you can train only regions or only lines (or a subset of types of either). The code actually supports multi-model inference for segmentation as well so you'd be able to mix and match models to your particular use-case. Of course with great flexibility comes great potential for blowing one's foot off.

mittagessen commented 3 years ago

BTW about the reading order question above. It's a bit complicated as the segmenter is designed to allow detection of non-textual regions such as stamps. Thus, regions are per definition unordered but textual regions (anything that contains lines) are treated as dummy lines for the purpose of determining the reading order, e.g. (L = line, R = region): L0 - L1 - R0 (L5 L4 L3) - L2 - R1 (L7 L8) with the actual lines in R0 and R1 being ordered separately and substituted afterwards so the final output is L0-L1-L5-L4-L3-L2-L7-L8.

That will change mid-term though as we need a more capable reading order thingy for the semitic abjads and parallel texts and such. That one will have a more explicit ordering with additional semantics attached.

bertsky commented 3 years ago

BTW about the reading order question above. It's a bit complicated as the segmenter is designed to allow detection of non-textual regions such as stamps. Thus, regions are per definition unordered but textual regions (anything that contains lines) are treated as dummy lines for the purpose of determining the reading order, e.g. (L = line, R = region): L0 - L1 - R0 (L5 L4 L3) - L2 - R1 (L7 L8) with the actual lines in R0 and R1 being ordered separately and substituted afterwards so the final output is L0-L1-L5-L4-L3-L2-L7-L8.

Ok, that's what I do as well in my Ocropy fork – only that I use recursive X-Y cut for region segmentation/grouping.

So, @kba we should try to wrap this functionality for PAGE here, too.

That will change mid-term though as we need a more capable reading order thingy for the semitic abjads and parallel texts and such. That one will have a more explicit ordering with additional semantics attached.

Am I right to assume you plan to do that with some neural modelling, @mittagessen?

mittagessen commented 3 years ago

Am I right to assume you plan to do that with some neural modelling, @mittagessen?

When you have a hammer everything looks like a nail, so yes. I had some basic code for a graph NN orderer but which features to actually use is quite unknown.

bertsky commented 2 years ago

Not worth a separate issue before merging: the current state of how kraken.blla.vec_regions is decoded seems to be wrong sometimes:

01:17:04.552 INFO processor.KrakenSegment - Finished segmentation, serializing
Traceback (most recent call last):
  File "/home/jaakko/ocr/Uusvenv/bin/ocrd-kraken-segment", line 33, in <module>
    sys.exit(load_entry_point('ocrd-kraken', 'console_scripts', 'ocrd-kraken-segment')())
  File "/home/jaakko/ocr/Uusvenv/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/jaakko/ocr/Uusvenv/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/jaakko/ocr/Uusvenv/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jaakko/ocr/Uusvenv/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/jaakko/ocr/ocrd_kraken/ocrd_kraken/cli/segment.py", line 8, in cli
    return ocrd_cli_wrap_processor(KrakenSegment, *args, **kwargs)
  File "/home/jaakko/ocr/Uusvenv/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 88, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/jaakko/ocr/Uusvenv/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 88, in run_processor
    processor.process()
  File "/home/jaakko/ocr/ocrd_kraken/ocrd_kraken/segment.py", line 89, in process
    for idx_region, region_poly in enumerate(res['regions']['text']):
KeyError: 'text'

(Reported by @helkejaa.)

mittagessen commented 2 years ago

What is happening is that it crashes anytime there isn't a region of type text (it also ignores all regions not of type text). The format of the res['regions'] dict is {'type_0': [reg_0, reg_1], 'type_1': [reg_1, reg_2, reg_3], ....}.

bertsky commented 2 years ago

@mittagessen,

What is happening is that it crashes anytime there isn't a region of type text (it also ignores all regions not of type text). The format of the res['regions'] dict is {'type_0': [reg_0, reg_1], 'type_1': [reg_1, reg_2, reg_3], ....}.

thanks. It would help to have an exhaustive list of possible region types so we can map them to PAGE-XML. Seems that you have such a thing in your parse_page, but for the opposite direction it would help if you could expose kraken.lib.xml.page_regions itself.

mittagessen commented 2 years ago

The region types are free text fields defined during segmentation model training so it isn't really possible to automatically map them to something semantically meaningful without knowing the data source. In our serializer we just output everything as a TextRegion for this reason.

The mapping in parse_page is a fallback that is used if more specific classes are absent.

bertsky commented 2 years ago

The region types are free text fields defined during segmentation model training so it isn't really possible to automatically map them to something semantically meaningful without knowing the data source. In our serializer we just output everything as a TextRegion for this reason.

The mapping in parse_page is a fallback that is used if more specific classes are absent.

Understood. So for a generic decoder I guess one would need to have the trained mapping parsed in as a parameter (which the model provider must ship along), as in ocrd-detectron2-segment, right?

bertsky commented 2 years ago

Oh, and what mapping have you trained your default blla.mlmodel with, btw? Is it safe to assume it will work with the parse_page fallback (as a default mapping)?

mittagessen commented 2 years ago

Yes, defining the semantics externally would be the way to deal with that.

The default model just has a default text region that does not have any deeper meaning as it is trained from a bunch of datasets without a shared taxonomy.

The mapping that parse_page produces is only for the training side, i.e. to create types de novo from element types without custom strings. They are not applied when running the segmenter or serializer.

bertsky commented 2 years ago

@kba I tried to address most of the concerns/issues we had above. There's much more we could do (see added fixmes), but I think we should make a first release for now. CI failures seem to be a permission problem.

bertsky commented 2 years ago

I had to revert the conditional binary input for the segmenter, because it looks like binarization always produces better results for blla.mlmodel. @mittagessen could you please comment?

mittagessen commented 2 years ago

Do you have an example of the material you're testing on? We don't really do binarization anymore as it breaks most degraded manuscripts so the model wasn't even evaluated (nor trained) on it. Would be good to see what exactly is happening.

bertsky commented 2 years ago

I had to revert the conditional binary input for the segmenter, because it looks like binarization always produces better results for blla.mlmodel. @mittagessen could you please comment?

Do you have an example of the material you're testing on? We don't really do binarization anymore as it breaks most degraded manuscripts so the model wasn't even evaluated (nor trained) on it. Would be good to see what exactly is happening.

Oh, in that case... I have compared with/out SBB binarization (using the latest model) on this material.

mittagessen commented 2 years ago

Huh, interesting. Those look fairly similar to the stuff in cBAD but if the binarization is excellent it could give a boost in accuracy. In any case, I wouldn't force inputs to be binarized but if there's a good one available and you get better results there's no reason not to use it.

bertsky commented 2 years ago

Huh, interesting. Those look fairly similar to the stuff in cBAD but if the binarization is excellent it could give a boost in accuracy. In any case, I wouldn't force inputs to be binarized but if there's a good one available and you get better results there's no reason not to use it.

In that case, we should make the choice dependent on the workflow – by passing an empty feature selector/filter when blla is used. I'll revert once again.

@kba, if you could look into the CI permissions problem?

kba commented 2 years ago

@kba, if you could look into the CI permissions problem?

CI is working again, there was an issue with the deployment key and some minor typos and missing models.

bertsky commented 2 years ago

BTW, one thing we could also add is model URLs in the ocrd-tool.json for segmentation and for recognition (especially with the new models from UB Mannheim. (We could even make the suffix .mlmodel disappear.)

bertsky commented 2 years ago

@kba I did all of the above and fixed the CI again (with a workaround for this new problem in core). Now ready for merging AFAICS.