Another observation is that Transformer-based methods
generally suffer from unaligned-length problem [49], which
denotes that the Transformer is hard to correct the vision
prediction if character number is unaligned with ground truth.
The unaligned-length problem is caused by the inevitable
implementation of padding mask which is fixed for filtering
context outside text length. Our iterative LM can alleviate
this problem as the visual feature and linguistic feature are
fused several times, and thus the predicted text length is also
refined gradually.
这段指的是什么问题?这套框架应该是不适用复杂的layout和很长的文本的吧,有大佬解释下这里解决的是啥问题么?
Another observation is that Transformer-based methods generally suffer from unaligned-length problem [49], which denotes that the Transformer is hard to correct the vision prediction if character number is unaligned with ground truth. The unaligned-length problem is caused by the inevitable implementation of padding mask which is fixed for filtering context outside text length. Our iterative LM can alleviate this problem as the visual feature and linguistic feature are fused several times, and thus the predicted text length is also refined gradually. 这段指的是什么问题?这套框架应该是不适用复杂的layout和很长的文本的吧,有大佬解释下这里解决的是啥问题么?