FangShancheng / ABINet

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
Other
420 stars 72 forks source link

论文细节求解释下 #97

Open menglin0320 opened 1 year ago

menglin0320 commented 1 year ago

Another observation is that Transformer-based methods generally suffer from unaligned-length problem [49], which denotes that the Transformer is hard to correct the vision prediction if character number is unaligned with ground truth. The unaligned-length problem is caused by the inevitable implementation of padding mask which is fixed for filtering context outside text length. Our iterative LM can alleviate this problem as the visual feature and linguistic feature are fused several times, and thus the predicted text length is also refined gradually. 这段指的是什么问题?这套框架应该是不适用复杂的layout和很长的文本的吧,有大佬解释下这里解决的是啥问题么?