baudm / parseq

Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)
https://huggingface.co/spaces/baudm/PARSeq-OCR
Apache License 2.0
582 stars 129 forks source link

Advice for two line recognition? #14

Open siddagra opened 2 years ago

siddagra commented 2 years ago

Just a request if you perhaps have any advice of how to improve performance on two line recognition. I am working on a dataset of 2000 images only, it has a few single line and a few double line license plates.

It is doing relatively fine for single line plates but gets many errors on the double line ones.

Few examples of errors in clear double line plates:

It also messes up on a few three line cases:

I also pretrained the model on 22K synthetic images with 12 fonts, with 40 total fonts if variations are included (bold, italic, light, etc), both double and single line, with distortions added to text, but this actually reduced overall performance once I transfered to real dataset of 2K images. It actually reduced performance instead of increasing it.

Most of the errors seem to be the model repeating chracters twice when they should not be repeated. Very rarely does it actually get the recognition itself wrong. Usually at the end of the first line, or the start of the second line. This makes me think it is some sort of allignment or positional encoding issue, but I would also greatly greatly appreciate your expert opinion on how to alleviate such issues.

As the self-attention of the transformer is 2D, it should have no issue with double line recognition.

I am also perplexed because the license plate format can only allow for a maximum of 4 digits at the end (from 1 to 9999), yet it often messes up by predicting 5 digits instead of a max of 4, even though I thought that the permutated language model should be more effective at modeling such a format.

I am thinking, could the reduced resolution in height dimension, or the lower dimension of patch embeddings be the culprit?

I was wondering if there was any simple/elegant solution to this without having to pretrain the whole model on synthtext, mjsynth, openvivo text, etc. but if not possible, I guess I am ready to do that too. I am thinking of resizing the height dimension, and just appending two different images on top of each other and appending the labels, but I assume that such changes would make the pretrained backbone provided unusable, and therefore I would need to train the whole model from scratch again.

Please help me with any relevant advice if possible.

bmusq commented 2 years ago

I have encounter the same kind of problems with fine tuning, namely:

I have not try on multiple lines but it also seems that the model has trouble dealing with very small text, like one or two characters. This could maybe explain why it fails on the middle line of your plate. @siddagra since it is two characters long

baudm commented 2 years ago

@siddagra

It is doing relatively fine for single line plates but gets many errors on the double line ones.

STR as a task is generally limited to single-line text recognition. If you want to recognize multiple lines of text, you may use a text detector to process the outputs upstream of an STR model. Or you can look at recent text spotter methods such as SwinTextSpotter or TESTR

I also pretrained the model on 22K synthetic images with 12 fonts, with 40 total fonts if variations are included (bold, italic, light, etc), both double and single line, with distortions added to text, but this actually reduced overall performance once I transfered to real dataset of 2K images. It actually reduced performance instead of increasing it.

As shown in our paper and prior work, training on real data is more sample-efficient than on synthetic data. Intuitively this makes sense because the closer the train and test data distributions are, the lower is the expected generalization error. Moreover, your 22k samples are way too few if you're training a model from scratch.

As the self-attention of the transformer is 2D, it should have no issue with double line recognition.

If you inspect the STR training data, you'll see that most (if not all) examples are single lines of text. Thus, a model trained on such data should not be expected to handle multi-line cases. STR is the tail-end of a full recognition pipeline and is expected to read localized and cropped single lines of text only.

I am also perplexed because the license plate format can only allow for a maximum of 4 digits at the end (from 1 to 9999), yet it often messes up by predicting 5 digits instead of a max of 4, even though I thought that the permutated language model should be more effective at modeling such a format.

If you already have a hard prior on the output, you should enforce it on the model instead of expecting it to magically "just work." Use 4 position tokens instead of using all 26 and relying on [EOS] to mark the end of the sequence.

I was wondering if there was any simple/elegant solution to this without having to pretrain the whole model on synthtext, mjsynth, openvivo text, etc. but if not possible, I guess I am ready to do that too. I am thinking of resizing the height dimension, and just appending two different images on top of each other and appending the labels, but I assume that such changes would make the pretrained backbone provided unusable, and therefore I would need to train the whole model from scratch again.

Since your test images are practically horizontal and consists of two lines, just split the image in two halves--upper and lower-- and just perform two separate recognition tasks on the two images.

siddagra commented 2 years ago

First and foremost I want to deeply thank both of you for all the help and the model itself thus far. I am very greatful and you guys have been of great help!

STR as a task is generally limited to single-line text recognition. If you want to recognize multiple lines of text, you may use a text detector to process the outputs upstream of an STR model. Or you can look at recent text spotter methods such as SwinTextSpotter or TESTR

I did not want to do this as it adds more inference time, and most text spotters are somewhat inaccurate on my dataset, and we do not have line level annotations to train text detection models. Though perhaps I can use Swin Text Spotter as it also does some form of recognition it seems.

As shown in our paper and prior work, training on real data is more sample-efficient than on synthetic data. Intuitively this makes sense because the closer the train and test data distributions are, the lower is the expected generalization error. Moreover, your 22k samples are way too few if you're training a model from scratch.

I used pretrained model but then further trained on synthetic data of my format, then on real data of my format.

If you inspect the STR training data, you'll see that most (if not all) examples are single lines of text. Thus, a model trained on such data should not be expected to handle multi-line cases. STR is the tail-end of a full recognition pipeline and is expected to read localized and cropped single lines of text only.

Yeah makes sense. I was just hoping that fine-tuning on two line dataset would help. But I guess not. Thanks.

If you already have a hard prior on the output, you should enforce it on the model instead of expecting it to magically "just work." Use 4 position tokens instead of using all 26 and relying on [EOS] to mark the end of the sequence.

I don't know if this would change anything as the model ends up with 5 digits due to it often repeating the first digit. What I was saying was that it seems confusing that the model is not able to model the fact that there should only be 4 digits and implicitly hopefully correct for such repetition errors.

eg. Ground Truth: GJ32AU9807 Predicted: GJ32AU99807 would just become GJ32AU9980 which would still be wrong.

Since your test images are practically horizontal and consists of two lines, just split the image in two halves--upper and lower-- and just perform two separate recognition tasks on the two images.

I will try that, but often, either due to different height of both lines, or rotation (one of the examples has a rotated plate, so it will likely cause errors in cropping where few letters are missing) or license plate detection inaccuracy, this may not work well and may introduce a different kind of issue: crop misalignment.

I guess my only solutions would be to either to train this from scratch on stitched data (take two real words and combine them on top of one another and append labels), to run swintextspotter, or to revert to some other models that seemed to be dealing with this two line data slightly better for whatever reason.

Thanks a lot for all the help and advice. If you do have some/any further advice, please do tell me. It is indeed very helpful and I am very greatful to you.

Also sorry if I was bothersome 😅

baudm commented 2 years ago

@siddagra

I did not want to do this as it adds more inference time, and most text spotters are somewhat inaccurate on my dataset, and we do not have line level annotations to train text detection models. Though perhaps I can use Swin Text Spotter as it also does some form of recognition it seems.

Actually if you take a second look at your inputs and the corresponding outputs, the model consistently makes an error at the 5th character. You could use this to your advantage. The model does not expect two-line inputs, and it has no character to represent line breaks, thus it tries make sense of the line break by repeating the last character of the first line, or the first character of the second line. If majority of your license plates contain 8 characters, just remove the 5th character if output length > 8.

It gets trickier for three-line inputs mainly because of the vertical compression (input image height is only 32px).

Yeah makes sense. I was just hoping that fine-tuning on two line dataset would help. But I guess not. Thanks.

If you want to go this route, try adding a character to represent a line break, e.g. [N], then modify the Tokenizer to replace \n with [N]. That way, you can incorporate this inductive bias into the model.

I guess my only solutions would be to either to train this from scratch on stitched data (take two real words and combine them on top of one another and append labels), to run swintextspotter, or to revert to some other models that seemed to be dealing with this two line data slightly better for whatever reason.

The main issue with multi-line inputs is that with the 32px image height, the vertical resolution won't be enough. I uploaded the weights for PARSeq-S which uses 224x224 px images and 16x16 px patch size. You can use this to initialize the encoder if you want to finetune on 224x224 px images.

Thanks a lot for all the help and advice. If you do have some/any further advice, please do tell me. It is indeed very helpful and I am very greatful to you.

You're welcome.

siddagra commented 2 years ago

The main issue with multi-line inputs is that with the 32px image height, the vertical resolution won't be enough. I uploaded the weights for PARSeq-S which uses 224x224 px images and 16x16 px patch size. You can use this to initialize the encoder if you want to finetune on 224x224 px images.

OMG Thanks a LOT! I also thought that the vertical height would be the issue as per the first post in this issue. I did not know that there are 224x224 px pretrained model as well! I am hoping/thinking it will give me good performance!

If you want to go this route, try adding a character to represent a line break, e.g. [N], then modify the Tokenizer to replace \n with [N]. That way, you can incorporate this inductive bias into the model.

This sounds like a brilliant idea. Will surely try this as well.

Though I think I will need to relabel a lot of data for this.. Will see/try maybe I can use swintextspotter to get good enough labels and then just refine those or something.

Actually if you take a second look at your inputs and the corresponding outputs, the model consistently makes an error at the 5th character. You could use this to your advantage. The model does not expect two-line inputs, and it has no character to represent line breaks, thus it tries make sense of the line break by repeating the last character of the first line, or the first character of the second line. If majority of your license plates contain 8 characters, just remove the 5th character if output length > 8.

The number of characters may change from plate to plate and also the line break position may change from plate to plate. At least slightly. Usually around that region but yeah perhaps I can add some sort of rule similar to this, especially for 5 digits. It was also having issues where sometimes it would repeat the last character. Your newline idea seems to be the best imo, but will tinker around and try to see what works. Thanks

siddagra commented 2 years ago

Hey, I am unable to load https://github.com/baudm/parseq/releases/download/v1.0.0/parseq_small_patch16_224-fcf06f5a.pt (the weights for 224x224) as the config file and/or state dict are not available in the checkpoint. Any way you could share the config file or the .ckpt?

I am getting errors due to mismatching configs or state_dict not being available when I try to use ckpt_path= argument. I changed image resolution to [224, 224] in main.yaml but I guess there are more hyperparameters that are also different, so please help me with this if possible.

baudm commented 2 years ago

@siddagra

Hey, I am unable to load https://github.com/baudm/parseq/releases/download/v1.0.0/parseq_small_patch16_224-fcf06f5a.pt (the weights for 224x224) as the config file and/or state dict are not available in the checkpoint. Any way you could share the config file or the .ckpt?

Check the referenced commit. You can load the weights easily like so:

import torch
from strhub.models.utils import create_model

parseq = create_model('parseq-patch16-224')
parseq.load_state_dict(torch.load('parseq_small_patch16_224-fcf06f5a.pt', map_location='cpu'))

To finetune, just use ./train.py +experiment=parseq-patch16-224 and manually load the weights inside train.py.

siddagra commented 2 years ago

Actually, finetuning pretrained model on synthetic data, and then finetuning further on real data, did improve accuracy by 6% on the same training setup (as compared to just pretrained+real). Earlier, it was not even actually loading the synthetic weights.

Able to get a slightly better accuracy with synthetic + real data, though overall accuracy does not seem to change too much from the base model (32 x 128).

I am confused as to what exactly StochasticWeightAveraging does and what would be an optimal swa_lrs value. It seems to increase the validation performance quite a bit, but sharply increases in training loss. Should I increase the number of epochs and reduce swa_lrs to 0.00001, while keeping num_iters the same, so that it can reach a low loss value with SWA?

baudm commented 2 years ago

This is a good primer on SWA: https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/

SWA improves generalization. I decided to keep it in the training pipeline since it significantly improved CRNN performance (the difference is less apparent in the other models). Another side effect is that model selection becomes simpler (you always choose last.ckpt) since weights from various points during training are averaged.

siddagra commented 2 years ago

Thanks a lot! I was able to increase my accuracy by about 10% (from 80% to 90%) on validation dataset by concatenating two random images on top of each other, and concatenating their labels, using the large real text datasets available(TextOCR and OpenVIVO), as well as about 130k synthetic training examples from various sources, and fine-tuning pretrained weights for 2 epochs.

Then I further fine-tuned these weights onto my original real license plate dataset (~2k images) for 10 more epochs and was able to get a 90% accuracy!

I was wondering if it was possible to have a weighted sampling strategy based on lmdb database size for pretraining, since the openvivo and textocr completely overpower my other datasets in size and therefore take priority. The smaller synthetic/real license plate datasets end up with much higher loss, and therefore convergence on validation dataset ends up being very slow and also seems to effect final accuracy.

Also, is there any way that I could access labels from two separate lmdbs and concatenate those? According to another research paper, contrastive captions, this can also increase accuracy significantly.

baudm commented 2 years ago

In my early experiments, I created and used a Sampler subclass which tried to cleanly implement the "batch-balanced" sampling of Baek et al. (clovaai/deep-text-recognition-benchmark): https://gist.github.com/baudm/fa08974319150c65caa96d6062b76aa9

This is how I used it before:

samplers = []
datasets = []
for d in ['MJ', 'ST']:
    dataset = hierarchical_dataset(root, self.hparams, select_data=d, transform=transform)[0]
    datasets.append(dataset)
    samplers.append(DistributedSampler(dataset))
r = [1., 1.]  # 50-50 split
dataset = ConcatDataset(datasets)
sampler = BatchBalancedSampler(samplers, r, self.hparams.batch_size, False)
dataloader = DataLoader(dataset, num_workers=4, pin_memory=True, batch_sampler=sampler)

r pertains to the weights used for sampling. [1, 1] means 50-50 split. 50% of the batch comes from MJ while the other 50% comes from ST.

siddagra commented 2 years ago

This is a good primer on SWA: https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/

SWA improves generalization. I decided to keep it in the training pipeline since it significantly improved CRNN performance (the difference is less apparent in the other models). Another side effect is that model selection becomes simpler (you always choose last.ckpt) since weights from various points during training are averaged.

I am still unsure what would be an optimum learning rate for SWA by this answer. Would it not matter much? Or should I use raytune for it? I pretrained on synthetic datasets and concatenated labels. I have about 2500 images in final finetuning training set and 300 in val set. if that matters.

For the OneCycleLR learning rate value, I used this article: https://towardsdatascience.com/finding-good-learning-rate-and-the-one-cycle-policy-7159fe1db5d6 which suggests the following:

The idea is to start with small learning rate (like 1e-4, 1e-3) and increase the learning rate after each mini-batch till loss starts exploding. Once loss starts exploding stop the range test run. Plot the learning rate vs loss plot. Choose the learning rate one order lower than the learning rate where loss is minimum( if loss is low at 0.1, good value to start is 0.01). This is the value where loss is still decreasing. Paper suggests this to be good learning rate value for model.

Does this approach seem good for this task?

siddagra commented 2 years ago

In my early experiments, I created and used a Sampler subclass which tried to cleanly implement the "batch-balanced" sampling of Baek et al. (clovaai/deep-text-recognition-benchmark): https://gist.github.com/baudm/fa08974319150c65caa96d6062b76aa9

This is how I used it before:

samplers = []
datasets = []
for d in ['MJ', 'ST']:
    dataset = hierarchical_dataset(root, self.hparams, select_data=d, transform=transform)[0]
    datasets.append(dataset)
    samplers.append(DistributedSampler(dataset))
r = [1., 1.]  # 50-50 split
dataset = ConcatDataset(datasets)
sampler = BatchBalancedSampler(samplers, r, self.hparams.batch_size, False)
dataloader = DataLoader(dataset, num_workers=4, pin_memory=True, batch_sampler=sampler)

r pertains to the weights used for sampling. [1, 1] means 50-50 split. 50% of the batch comes from MJ while the other 50% comes from ST.

I will try to add this. Thanks!

siddagra commented 2 years ago

Is there any way/what would be the most effective way to completely disable language modelling? I want to check the impact of this for research purposes.

baudm commented 2 years ago

Yes. You could simply use the NAR branch of the inference code:

https://github.com/baudm/parseq/blob/8fa51009088da67a23b44c9c203fde52ffc549e5/strhub/models/parseq/system.py#L135-L137

for training in training_step()

zixuwang1996 commented 2 years ago

Hello thanks for the great work!

I was testing the model on single-line images but with multiple words separated by whitespace. However, it seems PARSeq does not detect the whitespaces for a single-line sequences. For example, for the shared image, PARSeq returns WHATIT'SACTUALLYLIKE while the expected output should be WHAT IT'S ACTUALLY LIKE

Can you please share some ideas on this?

test
baudm commented 2 years ago

Hello thanks for the great work!

I was testing the model on single-line images but with multiple words separated by whitespace. However, it seems PARSeq does not detect the whitespaces for a single-line sequences. For example, for the shared image, PARSeq returns WHATIT'SACTUALLYLIKE while the expected output should be WHAT IT'S ACTUALLY LIKE

Can you please share some ideas on this?

test

@zixuwang1996 that's because the whitespace character is explicitly not supported by the implementation here. You can try adding support by adding the whitespace character to the charset, and disabling whitespace stripping (data.remove_whitespace=false).

zixuwang1996 commented 2 years ago

Thanks for your prompt reply. Just to clarify - this solution requires re-training right (as I did not see there is whitespace in current charset_train and charset_test)? with whitespace character to the charset and whitespace stripping disabled.

I was running read.py to quickly get OCR-ed text from some custom images. Would the solution also be useful for inference directly?

WongVi commented 1 year ago

@bmusq I trained the module by adding whitespace in character sets but not in case of validation network was not able to recognize whitespace. could you please let me know what will be the problem? I also did data.remove_whitespace=false during training.

Thank you

kuldeeps1208 commented 1 year ago

Dear Authors/Members,

I also have similar queries on training the PARSEQ model on multiline text images or more specifically number plates. After reading the above thread and as per my requirement, I have collected around 4k images with texts in multiple lines. Now my queries are :

  1. How should I mention the Ground truth in the gt.txt file. (like in case of single line we can just write the alphabets/numbers in one line)

  2. Once trained, what changes should I do in code so as to get the necessary output.

Thanks in advance.

huyhoang17 commented 1 year ago

@kuldeeps1208 I also trained a model to recognize multiple-line license plates and the model can handle it. The gt label is just a single-line txt file.

kuldeeps1208 commented 1 year ago

@huyhoang17 Thank you for your response.

Can you provide opinion on my 2nd point ?

kuldeeps1208 commented 1 year ago

@huyhoang17 Thank you for your response.

Can you provide opinion on my 2nd point ?

huyhoang17 commented 1 year ago

@kuldeeps1208 you should finetune the model from pretrained-weight even with the default setting, you can still get a good result

ZhanchongDeng commented 1 year ago

@siddagra

Hey, I am unable to load https://github.com/baudm/parseq/releases/download/v1.0.0/parseq_small_patch16_224-fcf06f5a.pt (the weights for 224x224) as the config file and/or state dict are not available in the checkpoint. Any way you could share the config file or the .ckpt?

Check the referenced commit. You can load the weights easily like so:

import torch
from strhub.models.utils import create_model

parseq = create_model('parseq-patch16-224')
parseq.load_state_dict(torch.load('parseq_small_patch16_224-fcf06f5a.pt', map_location='cpu'))

To finetune, just use ./train.py +experiment=parseq-patch16-224 and manually load the weights inside train.py.

I believe the 224x224 model is a great addition into the PARSeq family set. However I spent a lot of time to find this snippet to load the model. PARSeq small and PARSeq tiny are both loadable from torch.hub, whereas PARSeq 224 requires repo function. Please consider adding this snippet as documentation or (even better) support PARSeq 224 in torch hub as well.

baudm commented 1 year ago

@ZhanchongDeng

I believe the 224x224 model is a great addition into the PARSeq family set. However I spent a lot of time to find this snippet to load the model. PARSeq small and PARSeq tiny are both loadable from torch.hub, whereas PARSeq 224 requires repo function. Please consider adding this snippet as documentation or (even better) support PARSeq 224 in torch hub as well.

Please check commit ec46a86e4faf703ce158aa6c53fbcbdea84b36a5