FangShancheng / ABINet

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
Other
420 stars 72 forks source link

Questions for reimplementing ABINet #6

Open HumanZhong opened 3 years ago

HumanZhong commented 3 years ago

Thanks for your great work. I'm facing some trouble on the re-implementation of the vision part of your work, and I'd like to ask for more experimental details for re-implementation if it is possible.

  1. I noticed that in SRN, they used resnet50 as their backbone. While in your ABINet, you chose a much more light-weight backbone with only 5 residual blocks(it seems like a resnet18 or even lighter) for feature extraction(according to your arxiv paper's footprint) and you still achieved comparable results. Can you provide the detailed structure of your resnet backbone as well as the mini-Unet structure? Besides, can you provide the different configurations of your SV(small vision model), MV(medium vision model) and LV?

  2. Is the positional encoding and the order embedding(used as Q in attention) hard-coded or learned? Does different encoding methods affect the performance a lot?

  3. Can you provide the detailed parameters for augmentation methods? And How much does it affect the performance with and without data augmentation?

  4. How long approximately does it take for your model to reach convergence using 4x1080ti gpus?

Thanks again for your work, looking forward to your reply.

HumanZhong commented 3 years ago

I've sent an email to the authors and received their reply. To help researchers who want to conduct reimplementation, here are their responses:

  1. The backbone used in this paper is ResNet45 which is exactly the same as those used in ASTER and DAN. "5 residual blocks" in the paper actually means "5 residual stages or layers".
  2. Q is hard-coded and is sent to an fc layer to project it.
  3. Augmentation strategy will be released along with the code and it can make good performance gain especially in irregular texts.
  4. 4-5 days.

Besides, I've made an attempt to reimplement the vision part of this model, but achieved a much worse result. My average precision of the vision model is about 85+%, while the performance reported in the paper is 88+%.

If anyone has also tried to reimplement or has achieved higher performance, please leave a message here on how you make it if possible, that will be great help and big thanks in advance.

tambourine666 commented 3 years ago

Use parallel attention for vision model, still cannot hit 88.8% mentioned in paper.😅

FangShancheng commented 3 years ago

How about your accuracy now. Do you use Transformer as the sequence modeling layer, and use data augmentation? We'll release our code next week or so.