baudm / parseq

Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)
https://huggingface.co/spaces/baudm/PARSeq-OCR
Apache License 2.0
571 stars 126 forks source link

how do you do pretrained vitstr? #28

Closed daeing closed 2 years ago

daeing commented 2 years ago

Thanks for your excellent work! I notice your experiment VITSTR-Small get higher result compared to VITSTR-Base, the reason maybe you use MJ-SJ(using 5.5m) dataset. The input is (32 , 128) and patch size is (4, 8)? What is your max_seq? If I use input image is (32, 128) and patch size is (4, 8), I may get (b, 32/4 * 128/8 + 1, hidden_dim), and the max_seq set 25 may get something wrong, the dataset you offered contains label longer than 25 + 2?

baudm commented 2 years ago

The training data used does indeed affect the results significantly. The original ViTSTR-B used SynthText from clovaai/deep-text-recognition-benchmark which has less samples than the one used in this work and in the ABINet paper (https://github.com/FangShancheng/ABINet/issues/30#issuecomment-895710499).

I'm sorry but I don't understand your question. All the models use the same data configuration: https://github.com/baudm/parseq/blob/8fa51009088da67a23b44c9c203fde52ffc549e5/configs/main.yaml#L9-L10 The ViTSTR-S in the paper uses the following config: https://github.com/baudm/parseq/blob/8fa51009088da67a23b44c9c203fde52ffc549e5/configs/experiment/vitstr.yaml#L5-L7

daeing commented 2 years ago

The training data used does indeed affect the results significantly. The original ViTSTR-B used SynthText from clovaai/deep-text-recognition-benchmark which has less samples than the one used in this work and in the ABINet paper (FangShancheng/ABINet#30 (comment)).

I'm sorry but I don't understand your question. All the models use the same data configuration:

https://github.com/baudm/parseq/blob/8fa51009088da67a23b44c9c203fde52ffc549e5/configs/main.yaml#L9-L10

The ViTSTR-S in the paper uses the following config: https://github.com/baudm/parseq/blob/8fa51009088da67a23b44c9c203fde52ffc549e5/configs/experiment/vitstr.yaml#L5-L7

The origin vitstr may choose top 25 token, and add 2 token to do decode. But when I use mj-st dataset you provide(ABINet provide) got this error

RuntimeError: The expanded size of the tensor (27) must match the existing size (28) at non-singleton dimension 0. Target sizes: [27]. Tensor sizes: [28]

It may be the label length longer the map_label_length. So I asked if you use (32, 128) as input, the the patch_size is (4, 8), what you will do after this process

https://github.com/roatienza/deep-text-recognition-benchmark/blob/ea0d07737e334a97aa0a7df9af3118f85a2b49c2/modules/vitstr.py#L78

baudm commented 2 years ago

Sorry but I don't understand what you are trying to do.

The original ViTSTR defined two special tokens: [GO] and [s], and wrapped text between these tokens: https://github.com/roatienza/deep-text-recognition-benchmark/blob/ea0d07737e334a97aa0a7df9af3118f85a2b49c2/utils.py#L175 resulting in 25 + 2 = 27 batch_max_length.

The seqlen parameter is actually overridden by the training/testing code: https://github.com/roatienza/deep-text-recognition-benchmark/blob/ea0d07737e334a97aa0a7df9af3118f85a2b49c2/train.py#L183 In this case, it will be set to seqlen=27.

In the actual decoding, the logits for the first position is discarded: https://github.com/roatienza/deep-text-recognition-benchmark/blob/ea0d07737e334a97aa0a7df9af3118f85a2b49c2/test.py#L153

resulting in sequence length of 26. The additional character is for [s], which will eventually get truncated: https://github.com/roatienza/deep-text-recognition-benchmark/blob/ea0d07737e334a97aa0a7df9af3118f85a2b49c2/test.py#L176-L178

Hence, resulting in the implementation here: https://github.com/baudm/parseq/blob/8fa51009088da67a23b44c9c203fde52ffc549e5/strhub/models/vitstr/system.py#L48-L52

daeing commented 2 years ago

RuntimeError: The expanded size of the tensor (27) must match the existing size (28) at non-singleton dimension 0. Target sizes: [27]. Tensor sizes: [28]

Yes, I tried original vitstr code, and change the dataset as you provide mj-st, but I got this error

RuntimeError: The expanded size of the tensor (27) must match the existing size (28) at non-singleton dimension 0. Target sizes: [27]. Tensor sizes: [28]

This might be the label length is 28 logger that max_label_sequence + 2 [Go + S], So I wondered how you handle this situation.

baudm commented 2 years ago

I still don't understand what you're trying to do. If you're using the original ViTSTR code, then that is out of the scope of this project.

If you're using the training code here, then there shouldn't be any problem and you shouldn't get that error.

daeing commented 2 years ago

OK, I'm sorry about that. Maybe I got some wrong understanding, I'll read your paper clearily.