Thank you for your amazing work! Nevertheless, I still have some questions about your motivation.
As you mentioned in your paper(the abstract section): actions generated by existing methods may depend heavily on the co-occurrence of objects, e.g. ‘driving’ is predicted with high confidence whenever both man and car are detected.
I was wondering how did you notice this phenomenon. Did you reach this conclusion by making statistics on MSRVTT or MSVD dataset? If so, how did you make these statistics? Looking forward to your reply!
Thank you for your amazing work! Nevertheless, I still have some questions about your motivation.
As you mentioned in your paper(the abstract section): actions generated by existing methods may depend heavily on the co-occurrence of objects, e.g. ‘driving’ is predicted with high confidence whenever both man and car are detected.
I was wondering how did you notice this phenomenon. Did you reach this conclusion by making statistics on MSRVTT or MSVD dataset? If so, how did you make these statistics? Looking forward to your reply!