In Ablation experiments, How much data has been utilized for vision pre-training?

As mentioned in the paper, you use 20% training data(around 16M*0.2 = 3.2M) to train the model. I have some questions about it. Previously, the baseline model ABINet consists of three stages: vision pre-training, language-pre-training and final-training. In your settings, you use 20% data for final-training. Also, there is no doubt the BCN language model is pre-trained on WikiText. But How much data has been utilized for vision pre-training? 100% training data or 20% data? Hope to receive your answer, Thanks!

VDIGPKU / IterNet

In Ablation experiments, How much data has been utilized for vision pre-training? #4