JDAI-CV / image-captioning

Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]
268 stars 52 forks source link

transformer results #2

Closed homelifes closed 4 years ago

homelifes commented 4 years ago

Hello. Thanks for your work and for sharing the code. May you please tell me the details of the pure Transformer model you implemented which achieves 128.3 cider? To best of my knowledge, all implementations could achieve a maximum of around 126.6, according to all papers which utilized the transformer model. In your paper, you don't provide details on Transformer, and there is no supplementary material. So may I kindly know the details for your re-implementation of the pure transformer which achieves 128.3?

Panda-Peter commented 4 years ago

The implementation of Transformer can be referred from https://github.com/ruotianluo/self-critical.pytorch, which can achieve ~128 CIDEr score.

homelifes commented 4 years ago

Hi @Panda-Peter. Thanks for your reply. I am actually following his code, but according to the results here, it can achieve 1.266 with self-critical. It can achieve 1.295 but with the new self-critical proposed, which you do not use. So his reported score for transformer is 1.266 (Transformer+self_critical | 1.266), may I know how you achieved 1.283 and what changes you made in his code? Thanks a lot for your kind help

Panda-Peter commented 4 years ago

We also implement the baseline of Transformer based on this code. However, we found the primary hyper-parameters are not optimal. You can tune the parameters and obtain ~1.283 cider.