Thanx for ur great work! @HuangOwen
I got some quick questions.
Why u use sp-pose features (sp-pose stream) in network P, but only sp features (sp stream) in network C ? Dose not Pose Map benefit C ?
U may know some object detection methods, e.g., Faster R-CNN (tow-stage, R-CNN based) and SSD (one-stage, withour region proposal stage). Can i just say that, in principle, TIN 与之 iCAN, is quite similar with Faster R-CNN 与之 SSD? although they are from different tasks.
The tiny difference is: In Faster R-CNN, a binary score from binary classification (RPN) does not participate in calculating score in multi-class classification (final detection), while in TIN, final HOI score of a pair is obtained by Sc * Sp
Hi, @BestSongEver Thanks for your attention to our work.
As we have shown in Fig.4 in our paper, it is very intuitive that pose map helps a lot when making a decision of whether <human, object> is interactive or not. So using sp-pose features can explicitly improve the training of P. We believe pose information is also beneficial for directly classify HOIs but this is implicit. Actually, we have done some experiment which shows that encoding poses in the training of C almost not influence the mAP.
It's an interesting analogy but I think we also propose more different new ideas for HOI detection like NIS, LIS, pose map encoding which is not used by iCAN. What's more, generating <human, object> pair by exhaustive pairing is used by both iCAN and TIN, what we do is to train a transferable "filter" for all dataset to filter out those non-interactive candidate pairs. This is completely different from region proposal of one-stage/two-stage object detection.
Thanx for ur great work! @HuangOwen I got some quick questions.