Hi! I have read the paper about mPLUG-2, it's really a great vision-language foundation model with a fantastic design.
However, I have some doubts about the fairness of the SOTA comparison:
Actually, the visual encoder is initialized by CLIP visual encoder. For fair comparison and a better conclusion, I think the SOTA comparison should add CLIP pretraining methods.
For video-text retrieval, why Singularity reports 5M pretraining results? Actually, the 17M pretraining results are better than the mPLUG-2Base.
For video action recognition, some CLIP-based methods are missed, e.g., X-CLIP and UniFormerV2. Moreover, the models are actually fine-tuned with Kinetics-710. It should be marked in the main text to avoid misleading. Besides, it is first proposed in UniFormerV2, the citation should be added.
We will included it in the updated version. Actually, the mPLUG-2 Base surpasses Singularity(17M) with 7% improvement on R1 (41.5 v.s. 48.3) for MSRVTT, and it achieves better performance than Singularity on DiDeMo in terms of R5 and R10.
We aim to demonstrate the generalization ability of our proposed method with the proposed universal layer module on uni-modal datasets. For X-CLIP and UniformerV2, these two methods are designed for general video action recognition thus needs to follow the standard protocol. However, as the pre-training approaches which utilize much more extra video data (e.g. CoCa, InternVideo, Merlot-Reserve), we cannot ensure the fair comparison justification of these methods. We will cite UniFormerV2 and add Kinetics-710 clarification in the updated version.
Hi! I have read the paper about mPLUG-2, it's really a great vision-language foundation model with a fantastic design.
However, I have some doubts about the fairness of the SOTA comparison: Actually, the visual encoder is initialized by CLIP visual encoder. For fair comparison and a better conclusion, I think the SOTA comparison should add CLIP pretraining methods.