Some questions about your work

Thank you for your amazing work! Nevertheless, I still have some questions about your motivation.

As you mentioned in your paper(the abstract section): actions generated by existing methods may depend heavily on the co-occurrence of objects, e.g. ‘driving’ is predicted with high conﬁdence whenever both man and car are detected.

I was wondering how did you notice this phenomenon. Did you reach this conclusion by making statistics on MSRVTT or MSVD dataset? If so, how did you make these statistics? Looking forward to your reply!

SydCaption / SAAT

Some questions about your work #37