MCG-NJU / JoMoLD

[ECCV 2022] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
26 stars 3 forks source link

Questions about eval function and calculate_noise_ratio function #1

Closed Huntersxsx closed 1 year ago

Huntersxsx commented 1 year ago

Thanks for your great work! Your code is easy to understand and follow, however, I am confused about some detailed implementations:

Huntersxsx commented 1 year ago

Another difference is the contrastive loss, in the original CVPR2021-MA, they encourage the aggregated feature "x1" to be close to the low-level visual feature "x_visual" which is before HAN: x_visual = x2 x1, x2 = self.hat_encoder(x1, x2) xx1 = F.normalize(x_visual, p=2, dim=-1) xx2 = F.normalize(x1, p=2, dim=-1) while you contrast the features after HAN: x1, x2 = self.hat_encoder(x1, x2, with_ca=with_ca)
xx2_after = F.normalize(x2, p=2, dim=-1) xx1_after = F.normalize(x1, p=2, dim=-1) Is this a trick or a small error?

CarolineCheng233 commented 1 year ago

Thanks for your attention.

1) We utilize modality-level predictions to filter out respective frame-level false positive events as we think video-level outputs might depend heavily on one modality while ignoring another.

2) Using "a = a * Pa" yields similar results to that of directly using "a". For negative samples, it does not change the results. And for positive samples, it slightly moderates the predictions and has little effect on the final results.

3) Sure, you can calculate the number of positive samples for each category after line 148.

4) In fact, we tried to close the distance between features "x1" and "x_visual", and if we can remember clearly, it did not yield comparable results as they claimed in MA However, it can achieve similar results to the original paper by contrasting the features after HAN. So we modify the corresponding code. Whatever, our method can be combined with CL to achieve further improvement.

Huntersxsx commented 1 year ago

Thanks for your attention.

  1. We utilize modality-level predictions to filter out respective frame-level false positive events as we think video-level outputs might depend heavily on one modality while ignoring another.
  2. Using "a = a * Pa" yields similar results to that of directly using "a". For negative samples, it does not change the results. And for positive samples, it slightly moderates the predictions and has little effect on the final results.
  3. Sure, you can calculate the number of positive samples for each category after line 148.
  4. In fact, we tried to close the distance between features "x1" and "x_visual", and if we can remember clearly, it did not yield comparable results as they claimed in MA However, it can achieve similar results to the original paper by contrasting the features after HAN. So we modify the corresponding code. Whatever, our method can be combined with CL to achieve further improvement.

Thanks for your reply, I have just conducted experiments on the above doubts. 1、Utilizing modality-level predictions indeed has slightly better performance. 2、I got a bit of performance improvement when discarding "a = a * Pa". 4、I got similar results using "x_visual" or "x2" with your code. Each experiment I just carried out once and there may be some occasionalities. Thanks for your explanation again and look forward to your future work!