FangShancheng / ABINet

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
Other
420 stars 72 forks source link

Can not reproduce the pretrained vision model #30

Open yxgnahz opened 2 years ago

yxgnahz commented 2 years ago

Hello, I ran the code directly using the setting pretrain_vision_model.yaml, here are the results of the trained model:

Benchmark | Accuracy | IC13 | 92.6 | SVT | 87.2 | IIIT | 88.1 | IC15 | 78.7 | SVTP | 81.4 | CUTE80 | 79.5 | Average | 85.0 |.

It seems that the released pretrained vision has an average accuracy of 90%, so could you please tell me do you use the pretrain_vision_model.yaml to pretrain the vision model, and if you have some additional tricks or data to train the vision model?

FangShancheng commented 2 years ago

Hi, @yxgnahz

  1. Could you please provide your training environment and check which step differs from (or maybe differ from) the description in README. I cannot make a judgement without detailed information.
  2. The released models are trained using the given configurations directly which also follows our paper.
  3. The data is also the same with using the provided links. Note that data is directly cropped from the original datasets using GT boxes, and we have provided conversion tools.
  4. It would be helpful if your provide your detailed training setting.
ccx1997 commented 2 years ago

Is the training data the same with that released by clovaai (https://github.com/clovaai/deep-text-recognition-benchmark)? I use their released one, but can only get 85.2% by pretraining the vision model.

yxgnahz commented 2 years ago

I probably found the reason. I used the data from clovaai (https://github.com/clovaai/deep-text-recognition-benchmark), however, and I found there are about 5M images in ST while in this repo there are more than 6M images in ST. Using the data from clovaai can only get on average 85% for pretraining the vision model.

FangShancheng commented 2 years ago

Hi, @ccx1997 @yxgnahz

  1. To reproduce the reported accuracy, try to use the data processing method following the previous SOTAs (SRN, ASTER, SEED, Textscanner, etc,) that directly crop images from SynthText dataset using the GT boxes. You can use our conversion tools tools/crop_by_word_bb_syn90k.py, or our released data, or data from this repo.(ASTER https://github.com/ayumiymk/aster.pytorch)
  2. The data from clovaai (https://github.com/clovaai/deep-text-recognition-benchmark) was cropped followed by perspective transform, which will flatten the curved and distorted text (therefore less diversity) and also lose some valid examples (therefore less images).
baudm commented 2 years ago

@FangShancheng using your conversion tool (crop_by_word_bb_syn90k.py), I only get 5,295,444 valid samples from SynthText. 192,708 samples generated errors and were rejected by your script. Based on @yxgnahz's comment, your ST archive contains more than 6M images. Did you use the same script?

Meanwhile, clovaai's ST archive contains 5,522,807 images.

FangShancheng commented 2 years ago

Hi, @baudm we re-check this script (crop_by_word_bb_syn90k.py) and find that the discrepancy, which will filter the text that originally contains the special token. Change code in line 59:

  if len_now - len(txt_temp) != 0:
      # print('txt_temp-2-', txt_temp)
      continue

to

  if len_now - len(txt_temp) != 0:
      print('txt_temp-2-', txt_temp)

We will update the script later. Thanks for reminding and hope your feedback too.

baudm commented 2 years ago

Thanks, @FangShancheng. The script now generates more samples.

For completeness, and for everyone else generating the data from scratch, here's a comparison between the ClovaAI data and my generated data (using the scripts here) in terms of number of samples: Dataset ClovaAI Generated
MJ_train 7,224,586 7,224,600
MJ_test 891,924 891,924
MJ_val 802,731 802,733
SynthText 5,522,807 7,003,173

I don't know why there's a discrepancy in the MJSynth samples since no processing is being done there and both projects use the exact same script.

FangShancheng commented 2 years ago

Thanks, @FangShancheng. The script now generates more samples.

For completeness, and for everyone else generating the data from scratch, here's a comparison between the ClovaAI data and my generated data (using the scripts here) in terms of number of samples:

Dataset ClovaAI Generated MJ_train 7,224,586 7,224,600 MJ_test 891,924 891,924 MJ_val 802,731 802,733 SynthText 5,522,807 7,003,173 I don't know why there's a discrepancy in the MJSynth samples since no processing is being done there and both projects use the exact same script.

@baudm Good job.

  1. After checking the released data that we used in training our models, the images are 6976115 for SynthText dataset in the LMDB, which is less than 7,003,173 and we indeed using the same crop script.
  2. The MJSynth dataset provides cropped images, so how do you using crop script to get 7,224,600 images?
  3. One reason about the discrepancy in MJSynth between ClovaAI and your generated images, and between our released LMDB dataset and your generated images might exist in create_lmdb_dataset.py, which will also filter some invalid images.
FangShancheng commented 2 years ago

Hi, @baudm , we now provide a mirror of the dataset, which does not need an account to download.

MJ(https://rec.ustc.edu.cn/share/578cfbf0-fc5b-11eb-b3eb-d38a253722d6) ST(https://rec.ustc.edu.cn/share/69402a20-fc5b-11eb-8d52-7d4a03b38119)

baudm commented 2 years ago

@baudm Good job.

1. After checking the released data that we used in training our models,  the images are 6976115 for SynthText dataset in the LMDB, which is less than 7,003,173 and we indeed using the same crop script.

2. The MJSynth dataset provides cropped images, so how do you using crop script to get 7,224,600 images?

3. One reason about the discrepancy in MJSynth between ClovaAI and your generated images, and between our released LMDB dataset and your generated images might exist in `create_lmdb_dataset.py`, which will also filter some invalid images.

Thanks. Upon further checking, it seems like the ClovaAI MJSynth archive is correct. I modified create_lmdb_dataset.py to use PIL.Image for checking image validity. cv2.imdecode() seems to read the image headers, but doesn't actually decode the image contents, hence a few corrupted images were missed. After the modification, I got the exact same number of samples as in the ClovaAI archives.

Thanks for the archive mirrors!

Update: I also reproduced your ST dataset by filtering out samples which doesn't contain any alphanumeric labels. Final count is also 6,976,115.

HHeracles commented 2 years ago

@FangShancheng Hi,thanks for your work! I used your provided dataset, pretrained models and config files to reproduce the experimental data in your paper.I got the result as follows:

Model | IC13 | SVT | IIIT | IC15 | SVTP | CUTE | AVG ABINet-SV | 97.1 | 92.7 | 95.2 | 84.0 | 86.7 | 88.5 | 91.4 ABINet-LV | 97.0 | 93.2 | 96.4 | 85.9 | 89.0 | 89.2 | 92.6

The result of you provided ABINet-LV pretrained mode are almost the same as in the paper,but the result of you provided ABINet-SV pretrained mode are substantially lower than those given in the paper.What is the reason for this? What further steps should I take if I want to reproduce the results given in the paper?

My environment as the follows: Python 3.7.2 , torch 1.4.0,

FangShancheng commented 2 years ago

the results you given are almost the same as our provided models... from the statistics above ... . Do u miss something important? @HHeracles

HHeracles commented 2 years ago

the results you given are almost the same as our provided models... from the statistics above ... . Do u miss something important? @HHeracles

Everything I used is provided by you, including virtual environments, data sets, default configurations, pre-training models, etc.

FangShancheng commented 2 years ago

the results you given are almost the same as our provided models... from the statistics above ... . Do u miss something important? @HHeracles

Everything I used is provided by you, including virtual environments, data sets, default configurations, pre-training models, etc.

So, what is your accuracy of ABINet-SV now? the reported accuracy of ABINet-SV is about 91.4. @HHeracles

HHeracles commented 2 years ago

the results you given are almost the same as our provided models... from the statistics above ... . Do u miss something important? @HHeracles

Everything I used is provided by you, including virtual environments, data sets, default configurations, pre-training models, etc.

So, what is your accuracy of ABINet-SV now? the reported accuracy of ABINet-SV is about 91.4. @HHeracles

The reported accuracy of ABINet-SV is about 90.2, and the accuracy of ABINet-LV is about 92.6.The average of ABINet-SV in the table above is a clerical error. Sorry.

FangShancheng commented 2 years ago

the results you given are almost the same as our provided models... from the statistics above ... . Do u miss something important? @HHeracles

Everything I used is provided by you, including virtual environments, data sets, default configurations, pre-training models, etc.

So, what is your accuracy of ABINet-SV now? the reported accuracy of ABINet-SV is about 91.4. @HHeracles

The reported accuracy of ABINet-SV is about 90.2, and the accuracy of ABINet-LV is about 92.6.The average of ABINet-SV in the table above is a clerical error. Sorry.

Do you mean that you obtain only 90.2 accuracy for ABINet-SV, and actually the reported accuracy of released/reported models is about 91.4? How about your training time, and could you give your training log for further checking?

HHeracles commented 2 years ago

the results you given are almost the same as our provided models... from the statistics above ... . Do u miss something important? @HHeracles

Everything I used is provided by you, including virtual environments, data sets, default configurations, pre-training models, etc.

So, what is your accuracy of ABINet-SV now? the reported accuracy of ABINet-SV is about 91.4. @HHeracles

The reported accuracy of ABINet-SV is about 90.2, and the accuracy of ABINet-LV is about 92.6.The average of ABINet-SV in the table above is a clerical error. Sorry.

Do you mean that you obtain only 90.2 accuracy for ABINet-SV, and actually the reported accuracy of released/reported models is about 91.4? How about your training time, and could you give your training log for further checking?

Yes. I did not do any training, and just used the pretrained model you provided, that is the best-pretrain-vision-model.pth form https://pan.baidu.com/share/init?surl=b3vyvPwvh_75FkPlp87czQ.

HHeracles commented 2 years ago

the results you given are almost the same as our provided models... from the statistics above ... . Do u miss something important? @HHeracles

Everything I used is provided by you, including virtual environments, data sets, default configurations, pre-training models, etc.

So, what is your accuracy of ABINet-SV now? the reported accuracy of ABINet-SV is about 91.4. @HHeracles

The reported accuracy of ABINet-SV is about 90.2, and the accuracy of ABINet-LV is about 92.6.The average of ABINet-SV in the table above is a clerical error. Sorry.

Do you mean that you obtain only 90.2 accuracy for ABINet-SV, and actually the reported accuracy of released/reported models is about 91.4? How about your training time, and could you give your training log for further checking?

Thank you for your reply. I got the reason, I am loaded pretrain_vision_model.yaml not pretrain_vision_model_sv.yaml.