对比实验数据问题和数据集访问

你好，我对你们的工作很感兴趣！我注意到你们在论文里引用了CVPR2023的《Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network》（参考文献[65]），但在实际对比实验的时候没有给出它的数据进行对比。请问你们的工作和他们比起来怎么样呢？其次，VAANet（参考文献[66]）论文里给出的ve8和ek6数据集上的acc比你们表格中列出的结果高不止5个点，请问这个差距是因为你们的实验方法和VAANet论文中用的不一样吗？差别在哪里呢？最后，我对你们的eMotions数据集很感兴趣，请问需要哪些许可才能访问呢？我愿意按要求提供所需材料。感谢你的耐心阅读，谢谢~

Thank you for your interest in our work. 1. We cite [65] in the paper to indicate that AV-CPNet is different from the visual backbone they deploy. In addition, our AV-CPNet provides benchmark results for eMotions and is not designed to carry out absolute performance comparisons with other SOTA methods. In the future, we will provide the comparison results with [65]. 2. The implementation details are placed in the appendix. Following Video Swin-T [1], we use the optimization strategy of AdamW and weight decay=0.2, which is inconsistent with the optimization strategy described in VAANet [2]. Besides, we also deploy the optimization strategy in VAANet [2] to compare the performance of proposed AV-CPNet and VAANet [2] in appendix. 3. eMotions will be released after completing the final review and formulating the relevant acquisition rules.

Reference: [1] Liu Z, Ning J, Cao Y, et al. Video swin transformer[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 3202-3211. [2] Zhao S, Ma Y, Gu Y, et al. An end-to-end visual-audio attention network for emotion recognition in user-generated videos[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(01): 303-311. [65] Zhang Z, Wang L, Yang J. Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 18888-18897.

XuecWu / eMotions

对比实验数据问题和数据集访问 #3