DirtyHarryLYL / Transferable-Interactiveness-Network

Code for Transferable Interactiveness Knowledge for Human-Object Interaction Detection. (CVPR'19, TPAMI'21)
MIT License
228 stars 41 forks source link

What does "HO_weight" and "binary_weight" mean? #36

Closed yeliudev closed 4 years ago

yeliudev commented 4 years ago

Hi @DirtyHarryLYL ! Thanks a lot for your great work!

I noticed that in lib/networks/TIN_HICO.py, you've added two extra weights self.HO_weight and self.binary_weight to the classification scores from both HOI and binary classifiers, which is different from the code from iCAN. May I ask why did you multiply the weights with the raw classification scores and how are the weights be generated?

Thanks!

DirtyHarryLYL commented 4 years ago

The two sets of weights are used for the classification with long-tail data distribution. The more training samples a class has, its loss weight would be smaller (because it has more chances to get better, thus each update can be small). Meanwhile, the rare HOI class with fewer samples needs a larger weight. We use the k/(n^i/N) to decide the weights, i.e., N is the total sample number, n^i is the sample number of class i, like the frequency of occurrence of this HOI. k is decided on your experience. You could also try k*f(1/(n^i/N)), where f() is a non-linear function like lg() that we choose. Because it will make the curve smoother. If you want to choose a more convenient way, focal loss (Kaiming et.al.) is a good one to try, but it still needs the hyper-parameters trial.

yeliudev commented 4 years ago

Thank you for your reply.

It seems that these weights are used to handle the class-wise imbalance, but focal loss is designed for heavy hard/easy imbalance or pos/neg imbalance, how can it be used for this problem?

I've tried to apply focal loss or GHM loss to train the binary classifier, but the results are almost the same with training with BinaryCrossEntropy loss with balanced sampling.

DirtyHarryLYL commented 4 years ago

Yep, the performances of various loss tricks are comparable in HICO-DET, in our experiments the log loss weight performs best for HOI classification. For extreme rare classes (many), all these tricks contribute very small.

DirtyHarryLYL commented 4 years ago

BTW, each HOI adopted a Sigmoid for binary classification because of the multi-label problem (i.e. one person can perform multiple actions simultaneously). Thus we computed the sum of the 600 Sigmoid cross entropies as the HOI loss.

yeliudev commented 4 years ago

Thanks, your answer helps me a lot.

yeliudev commented 4 years ago

Hi @DirtyHarryLYL , I'm still confused about the imbalance of pos and neg samples for each class.

Since you used image-centric sampling strategy, for each training batch, all the candidate box pairs come from the same image, and you update the whole model using SigmoidCrossEntropy loss. However, it may happen that, i.e. for the first 10000 images in one epoch, there is not even one sample for HOI class n (0<n<599), so that the model would only learn from the neg samples of class n and always predict 0 for this class, which may cause the death of the classifier. How did you deal with this imbalance problem?

Besides, I noticed that the model predicts a 1x2 vector to represent each binary label, why not just use a single 0 or 1 since one number is enough to represent the probability of "interactiveness"?

DirtyHarryLYL commented 4 years ago

In each mini-batch, i.e., samples from one image, we will input the fixed number of pos and neg human-object pairs into the model (e.g. 15 pos pairs and 60 neg pairs). Pos and neg are said for each class, i.e., a pair can be positive for HOI i but negative for HOI j. Thus, we can keep the pos:neg ratio in training by inputting curated samples.

We have tried both single scalar 0~1, or 1x2 vector (two probabilities for pos and neg). The results are comparable. In the initial version, we choose the 1x2 vector for the convenience of the analysis and did not change in the later versions. You could also try other output formats in your experiments.

yeliudev commented 4 years ago

Thank you.

DirtyHarryLYL commented 4 years ago

No problem~

xxxzhi commented 4 years ago

could you provide the formula for weights in detail ?

I found k*f(1/(n^i/N)) can not obtain the weights in the file in HICO. For example, label 600 and 597 have same frequency (2) in HICO in the training set.

But the weights in your code is different: 9.609821 and 13.670264

DirtyHarryLYL commented 4 years ago

The formula is just the simple frequency as probability, i.e. k*lg(1/frequency). I think the problem comes from the sample number: pos number = gt pair number (following the ican policy)

neg number has two parts:

  1. number of gt human boxes (iou>0.5)* number of gt obj boxes (iou>0.5)- number of gt pair boxes, that is, the wrong pairings;
  2. the number of pairs composed of inaccurate human and obj boxes (iou < 0.5).

To my knowledge, the 600 and 597 HOIs have the same gt pair numbers, so the results are different.

xxxzhi commented 4 years ago

Thanks for your reply. Yeah, I made a mistake. This weight will affect the performance largely.

DirtyHarryLYL commented 4 years ago

No problem~ yeah, long-tail data distribution learning is still an open question, the studies on loss, data sampling, latent space learning are very interesting.