baudm / parseq

Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)
https://huggingface.co/spaces/baudm/PARSeq-OCR
Apache License 2.0
592 stars 130 forks source link

Fixed Decoding Schemes during inference and Random permutations during training #43

Closed ChinmayMittal closed 2 years ago

ChinmayMittal commented 2 years ago

Hi !!! I appreciate how using different permutation sequences during training can help STR models. I have a few doubts to which I could not find answers in the paper.

1) Although the model is trained on several permutations, during inference time, the decoding schemes are fixed, which is left to right in the case of the AR model. How will the training help this inference (i.e. how will training on the right to left permutation say help left to right decoding during inference)? Why are we not trying several permutations during inference and picking the best?

2) Are there any ablations for the values of K. e.g. does choosing K=6 to give any benefits over K=1( fixed left to right permutation during training )

baudm commented 2 years ago

Please see #42.

  1. As said in the paper, we only use two contrasting schemes (AR and NAR) although countless variations of decoding schemes can be used. This is to make the decoding process easier to understand and more comparable to prior work. Training on random permutations allows PARSeq to become an ensemble of AR models, effectively learning the conditional character probabilities required to implement any kind of decoding, from non-autoregressive to autoregressive, and everything in between such as semi-autoregressive and iterative refinement models. With the released weights, you can implement the decoding strategy that you want without retraining the model.

  2. Yes. This is discussed in Section 4.4 of the paper. In hindsight, the results we got for K resembles Occam's Hill. It is also somehow related to model compression due to parameter sharing [1].

[1] Hoefler, Torsten, et al. "Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks." J. Mach. Learn. Res. 22.241 (2021): 1-124.

ChinmayMittal commented 2 years ago

Thanks for your quick response !!! If I use no permutations, i.e. K=1, your model still seems to be giving State of the Art results on most datasets. What is the differentiating factor of your architecture in this case which makes it state of the art?

baudm commented 2 years ago

In terms of architecture, it really isn't any different from other Transformer-based encoder-decoder models like SATRN [1]. One major difference that we did is we used a deep encoder-shallow decoder configuration. We found that this configuration is much more efficient than those using deep decoders. Kasai et al. [2] arrived at the same conclusion for NLP tasks. It also allows the usage of a bidirectional (cloze) mask without complications.

The training dataset used actually plays a big part in the final results [3, 4]. However, in our paper we actually showed that the standard STR benchmarks are insufficient for evaluating STR model differences. Once trained on real data, all models start to get really high accuracies on the standard benchmark. So if you look at the results on the standard benchmarks, you'd see very little difference with K = 1 and with PARSeq.

One thing that deserves more experimentation is the effect of K on word accuracy when evaluated on harder datasets like Uber-Text, COCO-Text, etc. I have visualizations of the attention masks for K = 1 and for K > 2 (these did not make it into the paper) and the differences are very clear. For standard AR modeling (K = 1), the visual attention mask for each position tend to focus only on a single character. When K > 2, the visual attention tends to consider nearby characters also. To me, this signifies a more robust model overall. It means that the "vision part" of the model also learns to take language context into consideration.

[1] Lee, Junyeop, et al. "On recognizing texts of arbitrary shapes with 2D self-attention." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020. [2] Kasai, Jungo, et al. "Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation." ICLR. 2021. [3] Baek, Jeonghun, et al. "What is wrong with scene text recognition model comparisons? dataset and model analysis." Proceedings of the IEEE/CVF international conference on computer vision. 2019. [4] https://github.com/FangShancheng/ABINet/issues/30#issuecomment-895710499

ChinmayMittal commented 2 years ago

Thanks for your response.