Open abhikatoldtrafford opened 3 years ago
The problem is not the size of the image, but the size of the text. The network can at most handle as many characters per line as there are horizontal pixels in the image (+1 pixel for every repeated letter). Could it be the case that some of your line images are rotated by 90 degrees? This would explain why your images are too short for your texts. Another possibility could be that you have set a --text-normalization that breaks up diacritics or other signs that are written in combination with each other, resulting in a larger codec and too many chars per line for the network.
@abhikatoldtrafford I would also assume (but please verify this) that the lines that raise this warning are corrupt or not usable for training. If you are uncertain, please share some of your lines
Hi @ChWick @andbue my images are not rotated by 90 degree. Just check one example: joined_9_th_dec_5870_tilda_23_th_nov_701_23_th_nov_97.gt.txt
This line seems fine. It this one of the lines that are skipped? The ID of the error message indicates the file path. (Rescaling to a height of 48px yields a with of 637px. Subsampling (factor 4) yields 160 maximum characters, but the gt only comprises 86)
@ChWick yes this is one of the images skipped. I copied the path from the warning and downloaded this image. I have lots of similar images which are getting skipped.
I think the inverted part is a problem here. The center normalizer fails to find a center line and ends up padding large amounts of white at the top and at the bottom. After scaling to 48px, it ends up at less than 350px width. Commenting out the dewarping part of the CenterNormalizer would avoid this. The cleanest way to handle this kind of line would probably be to apply some preprocessing that transforms partially inverted text correctly.
@andbue inverted image is part of the augmentation I use. I have lots of images which I add to create these kind of images. You can get that from the filename.
@andbue where should I comment the dewrapping part out?
This was my current guess, too. Thanks for verifying this @andbue . The preproc fails. For now you could simply try to skip the dewarper (since your lines are flat anyways):
add the --data_preprocessing DataRangeNormalizer ScaleToHeightProcessor FinalPreparation
parameter (by default its --data_preprocessing DataRangeNormalizer CenterNormalizer FinalPreparation
), or disable inverting as augmentation
Should I do calamari-train...--data_preprocessing DataRangeNormalizer ScaleToHeightProcessor FinalPreparation...
Correct
@abhikatoldtrafford Is it working?
@ChWick I was getting this error: 'DataParams' object has no attribute 'line_height'. Which was previously not present... I edited the scale_to_height_processor.py for now from self.height=self.param.line_height to self.height=48
Not getting Skipping Lines warning anymore, by disabling CenterNormalizer thanks!
Thanks for pointing out this bug!
@ChWick @andbue I tried center normalization using the code from Calamari-OCR/calamari/blob/master/calamari_ocr/ocr/dataset/imageprocessors/center_normalizer.py for the image. Here is what I found:
a) dewarped b) normalized As @andbue said, it is having large paddings. So, is there any fix for large amount of padding? I need help! Thanks
@abhikatoldtrafford The CenterNormalizer is relying on white background and black foreground. It is hard to redesign the processing to also deal with mixed lines. Therefore, you have to stick with the ScaleToHeight processer for your case, or drop these kinds of augmentations. But based on your lines, the scale to height processor should work just fine, since dewarping should not be required.
You could also try to set the first value of the "extra_parameters" argument in CenterNormalizer to some smaller value than the default 4. When there are heavily warped lines, this could lead to some parts being cut off, but in your case it should be fine.
I'm not sure, however, that the partly inverted images do not impair the overall performance of the model. If you have any results on that, it would be interesting if you could share your evalution results with and without inverting with us here!
@andbue thanks, value of 2 does fix the issue a bit. However, I dont know why it should degrade the model performance. Is there any particular reason behind your intuition?
I guess the model would have to either learn two versions for every character (normal and inverted) or learn to invert characters or come up with some efficient kind of edge detection. When your data includes a lot of lines that look like this the question would be if it produces more errors to force the OCR model to cope with these problems or to split and invert the lines before throwing them into calamari. But I'm just guessing here, I'm really interested in your empirical results!
@andbue I have changed "extra_parameters" argument in CenterNormalizer from 4 to 2. I have restarted training with this value. Now, during prediction, do I need to change anything? Does CenterNormalizer get called in prediction time as well?
Also. @ChWick I used --data_preprocessing DataRangeNormalizer ScaleToHeightProcessor FinalPreparation parameterfor training and that model does not give good performance. Do I have to give the same data postprocessing command in calamari-predict in prediction time?
If your prediction dataset contains inverted lines, you should also use the modified CenterNormalizer for that. Otherwise your special lines would be skipped here as well.
The option --data_preprocessing currently does not exist in the predict script. The model automatically loads the same preprocessors that have been used for training.
Hi @ChWick @andbue while I was training calamari-ocr, I am consistently getting: 'Warning: Skipping line with longer outputs than inputs' for larger images having more width. Is there any way to fix this problem? I want them to be included in the training dataset, not just supress the warning. Thanks