We are studying TIIM and found many problems, the most critical confusions are:
This codes employs model/transformer/Transformer but not the model/transformer/TransformerMonotonic, which should be the main ideal of using MoCha mentioned in the paper.
In TransformerMonotonic, the image features are organized in HxNWxC, that means features are scanned in the order of row by row, not column by column, but the paper puts emphasis on column and explained why vertical features do better to the translating.
Did I misunderstand the whole thing in paper and codes? Please correct me if I'm wrong.
We are studying TIIM and found many problems, the most critical confusions are:
Did I misunderstand the whole thing in paper and codes? Please correct me if I'm wrong.
Thanks!