Warning: Skipping line with longer outputs than inputs

abhikatoldtrafford commented 3 years ago

Hi @ChWick @andbue while I was training calamari-ocr, I am consistently getting: 'Warning: Skipping line with longer outputs than inputs' for larger images having more width. Is there any way to fix this problem? I want them to be included in the training dataset, not just supress the warning. Thanks

andbue commented 3 years ago

The problem is not the size of the image, but the size of the text. The network can at most handle as many characters per line as there are horizontal pixels in the image (+1 pixel for every repeated letter). Could it be the case that some of your line images are rotated by 90 degrees? This would explain why your images are too short for your texts. Another possibility could be that you have set a --text-normalization that breaks up diacritics or other signs that are written in combination with each other, resulting in a larger codec and too many chars per line for the network.

ChWick commented 3 years ago

@abhikatoldtrafford I would also assume (but please verify this) that the lines that raise this warning are corrupt or not usable for training. If you are uncertain, please share some of your lines

abhikatoldtrafford commented 3 years ago

Hi @ChWick @andbue my images are not rotated by 90 degree. Just check one example: joined_9_th_dec_5870_tilda_23_th_nov_701_23_th_nov_97.gt.txt

ChWick commented 3 years ago

This line seems fine. It this one of the lines that are skipped? The ID of the error message indicates the file path. (Rescaling to a height of 48px yields a with of 637px. Subsampling (factor 4) yields 160 maximum characters, but the gt only comprises 86)

abhikatoldtrafford commented 3 years ago

@ChWick yes this is one of the images skipped. I copied the path from the warning and downloaded this image. I have lots of similar images which are getting skipped.

andbue commented 3 years ago

I think the inverted part is a problem here. The center normalizer fails to find a center line and ends up padding large amounts of white at the top and at the bottom. After scaling to 48px, it ends up at less than 350px width. Commenting out the dewarping part of the CenterNormalizer would avoid this. The cleanest way to handle this kind of line would probably be to apply some preprocessing that transforms partially inverted text correctly.

abhikatoldtrafford commented 3 years ago

@andbue inverted image is part of the augmentation I use. I have lots of images which I add to create these kind of images. You can get that from the filename.

abhikatoldtrafford commented 3 years ago

@andbue where should I comment the dewrapping part out?

ChWick commented 3 years ago

This was my current guess, too. Thanks for verifying this @andbue . The preproc fails. For now you could simply try to skip the dewarper (since your lines are flat anyways): add the --data_preprocessing DataRangeNormalizer ScaleToHeightProcessor FinalPreparation parameter (by default its --data_preprocessing DataRangeNormalizer CenterNormalizer FinalPreparation), or disable inverting as augmentation

abhikatoldtrafford commented 3 years ago

Should I do calamari-train...--data_preprocessing DataRangeNormalizer ScaleToHeightProcessor FinalPreparation...

ChWick commented 3 years ago

Correct

ChWick commented 3 years ago

@abhikatoldtrafford Is it working?

abhikatoldtrafford commented 3 years ago

@ChWick I was getting this error: 'DataParams' object has no attribute 'line_height'. Which was previously not present... I edited the scale_to_height_processor.py for now from self.height=self.param.line_height to self.height=48

Not getting Skipping Lines warning anymore, by disabling CenterNormalizer thanks!

ChWick commented 3 years ago

Thanks for pointing out this bug!

abhikatoldtrafford commented 3 years ago

@ChWick @andbue I tried center normalization using the code from Calamari-OCR/calamari/blob/master/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py for the image. Here is what I found:

a) dewarped b) normalized normalized_img As @andbue said, it is having large paddings. So, is there any fix for large amount of padding? I need help! Thanks

ChWick commented 3 years ago

@abhikatoldtrafford The CenterNormalizer is relying on white background and black foreground. It is hard to redesign the processing to also deal with mixed lines. Therefore, you have to stick with the ScaleToHeight processer for your case, or drop these kinds of augmentations. But based on your lines, the scale to height processor should work just fine, since dewarping should not be required.

andbue commented 3 years ago

You could also try to set the first value of the "extra_parameters" argument in CenterNormalizer to some smaller value than the default 4. When there are heavily warped lines, this could lead to some parts being cut off, but in your case it should be fine.

I'm not sure, however, that the partly inverted images do not impair the overall performance of the model. If you have any results on that, it would be interesting if you could share your evalution results with and without inverting with us here!

abhikatoldtrafford commented 3 years ago

@andbue thanks, value of 2 does fix the issue a bit. However, I dont know why it should degrade the model performance. Is there any particular reason behind your intuition?

andbue commented 3 years ago

I guess the model would have to either learn two versions for every character (normal and inverted) or learn to invert characters or come up with some efficient kind of edge detection. When your data includes a lot of lines that look like this the question would be if it produces more errors to force the OCR model to cope with these problems or to split and invert the lines before throwing them into calamari. But I'm just guessing here, I'm really interested in your empirical results!

abhikatoldtrafford commented 3 years ago

@andbue I have changed "extra_parameters" argument in CenterNormalizer from 4 to 2. I have restarted training with this value. Now, during prediction, do I need to change anything? Does CenterNormalizer get called in prediction time as well?

Also. @ChWick I used --data_preprocessing DataRangeNormalizer ScaleToHeightProcessor FinalPreparation parameterfor training and that model does not give good performance. Do I have to give the same data postprocessing command in calamari-predict in prediction time?

andbue commented 3 years ago

If your prediction dataset contains inverted lines, you should also use the modified CenterNormalizer for that. Otherwise your special lines would be skipped here as well.

The option --data_preprocessing currently does not exist in the predict script. The model automatically loads the same preprocessors that have been used for training.

Calamari-OCR / calamari

Warning: Skipping line with longer outputs than inputs #206