Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
GNU General Public License v3.0
1.05k stars 211 forks source link

Exchanging preprocessor #297

Open twerkmeister opened 2 years ago

twerkmeister commented 2 years ago

Hi @ChWick @andbue!

Thanks for this amazing project. I am using calamari as part of a data extraction task for tables in mid 20th century documents. Specifically I run calamari on the (single line) cells of the tables and had a really satisfying experience so far and made good progress training and tuning on my growing dataset (12500 cell images by now). I've been digging a bit deeper into the documentation and code and one of the experiments I want to try is turning of the center normalizer preprocessor as it seems to be doing a few things that I am not sure are necessary or helpful for my data. For example, image quality at times is already low, so additional blurring might be hurting. Second, my lines aren't skewed. I deal with skewing before overlaying the table grid onto the page.

I saw this paramater in the docs:

  --data.pre_proc.processors [DATA.PRE_PROC.PROCESSORS [DATA.PRE_PROC.PROCESSORS ...]]

and also this code in another issue

>>> from calamari_ocr.ocr.dataset.data import Data
>>> params = Data.default_params()
>>> list(enumerate(params.pre_proc.processors))
[
(0, CenterNormalizerProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, extra_params=(4, 1.0, 0.3), line_height=-1)), 
(1, FinalPreparationProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, normalize=True, invert=True, transpose=True, pad=16, pad_value=0)), 
(2, BidiTextProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.TARGETS: 'targets'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, bidi_direction=<BidiDirection.AUTO: 'auto'>)), 
(3, StripTextProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.TARGETS: 'targets'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>})), 
(4, TextNormalizerProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.TARGETS: 'targets'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, unicode_normalization='NFC')), 
(5, TextRegularizerProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.TARGETS: 'targets'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, replacement_groups=[<ReplacementGroup.Spaces: 'spaces'>], replacements=None)), 
(6, AugmentationProcessorParams(modes={<PipelineMode.TRAINING: 'training'>}, augmenter=DefaultDataAugmenterParams(), n_augmentations=0)), 
(7, PrepareSampleProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}))
]

What I would like to try is to just exchange the center normalizer preprocessor for the basic scale to height preprocessor. But I am not sure how to achieve this. Do I need to define all the preprocessors and their parameters using --data.pre_proc.processors? Would you mind giving me an idea how to reference the classes and their parameters properly? And does my reasoning to exchange the preprocessor have some merit?

Best, Thomas

andbue commented 2 years ago

Hi Thomas, it is not possible to insert another processor without providing the full set of parameters by providing module path and ProcessorParams-class for each of them. It's quite a mouthful:

--data.pre_proc.processors \
calamari_ocr.ocr.dataset.imageprocessors.scale_to_height_processor:ScaleToHeightProcessorParams \
calamari_ocr.ocr.dataset.imageprocessors.final_preparation:FinalPreparationProcessorParams \
calamari_ocr.ocr.dataset.textprocessors:BidiTextProcessorParams \
calamari_ocr.ocr.dataset.textprocessors:StripTextProcessorParams \
calamari_ocr.ocr.dataset.textprocessors:TextNormalizerProcessorParams \
calamari_ocr.ocr.dataset.textprocessors:TextRegularizerProcessorParams \
calamari_ocr.ocr.dataset.imageprocessors:AugmentationProcessorParams \
calamari_ocr.ocr.dataset.imageprocessors:PrepareSampleProcessorParams
twerkmeister commented 2 years ago

Thank you so much @andbue I think that's the starting point I needed 👍 will try it out tomorrow

bertsky commented 1 month ago

This should be way easier to find. The CLI (tfaip's subcommand self-documentation) is not helpful.

Also, this should be the default if passing --train.channels 3...