Questiong about the "masking" strategy in the paper

yeliudev commented 4 years ago

Hi @BigRedT ! I'm studying your paper and still have a little question about your "masking" strategy. In your approach, you mask out all the easy negative candidates during both training and testing, and it does reduce the candidate boxes significantly, but have you tried not to mask out these candidates and predict scores for 600 HOI classes (instead of 117 predicates)? Is the mAP performance better or worse compared to the results in your paper? I think it's an interesting trade-off between whether we should fully "trust" the detector or not : )

BigRedT commented 4 years ago

Hi @goolhanrry

I think your question talks about two different aspects of our model. So let me address them separately:

Masking is a mechanism to reject easy negatives. For example, if after NMS and score thresholding our detector does not believe an object is a car then we trivially predict 0 probability for HOI human-driving-car. This is ensured by our indicator terms and the model does have to waste capacity on learning this trivial behavior for such easy negatives.
Whether we directly predict scores for 600 HOI categories or make factorization assumptions such as modeling HOI scores as products of interaction and detector terms is a decision independent of the decision to mask as described in 1. Factorization enables us to make more efficient use of the available data. For example, assuming uniform distribution over categories, if we have D number of images in the dataset, we have D/117 samples per interaction category but only D/600 samples per HOI category (i.e. the combination of object and interaction). Factorization helps us exploit the compositional nature of HOI label space when faced with limited data as in the case of HICO-Det. The approach without factorization (directly predicting scores for 600 HOI categories) was used by https://arxiv.org/pdf/1702.05448.pdf

Finally, you raised an important point about whether we should fully "trust" the detector or not. My take on this is that if we cannot trust a detector which is designed and trained solely for the purpose of detecting objects, what hope do we ever have of learning this behavior from a more complex task like HOI detection with even lesser data :)

Hope this helps!

yeliudev commented 4 years ago

Thank you! Your answer helps me a lot :)

yeliudev commented 4 years ago

Hi @BigRedT ! May I ask just one more question about the training procedure? (Sorry to bother you so many times, I have no CUDA devices to run the code so that I can't debug it by myself)

According to the paper, you've mentioned that for each positive sample (with respect to HOI class h) in one image, sampling a high ratio (1000) of negative samples would improve the performance. However, this will cause a serious imbalance between numbers of positive and negative samples (which may cause the degeneration of the model). I've read the code in exp/hoi_classifier/train.py , it seems that no extra weights are added before computing loss. So I'm confused about how did you deal with the imbalance problem?

Thanks!

BigRedT commented 4 years ago

Hi @goolhanrry, we did not explicitly deal with the imbalance. However, our model has indicator functions that trivially produce 0 probability for easy negatives and hence 0 loss for these negatives (what I described as easy negative rejection in my earlier answer). Therefore the model can focus on minimizing the loss for positives and hard negatives which are more balanced.

In any case, we verified the need for higher sampling ratio empirically in Tab. 2 in the paper.

yeliudev commented 4 years ago

Thank you!

BigRedT / no_frills_hoi_det

Questiong about the "masking" strategy in the paper #8