Closed Huntersxsx closed 1 year ago
Another difference is the contrastive loss, in the original CVPR2021-MA, they encourage the aggregated feature "x1" to be close to the low-level visual feature "x_visual" which is before HAN:
x_visual = x2
x1, x2 = self.hat_encoder(x1, x2)
xx1 = F.normalize(x_visual, p=2, dim=-1)
xx2 = F.normalize(x1, p=2, dim=-1)
while you contrast the features after HAN:
x1, x2 = self.hat_encoder(x1, x2, with_ca=with_ca)
xx2_after = F.normalize(x2, p=2, dim=-1)
xx1_after = F.normalize(x1, p=2, dim=-1)
Is this a trick or a small error?
Thanks for your attention.
1) We utilize modality-level predictions to filter out respective frame-level false positive events as we think video-level outputs might depend heavily on one modality while ignoring another.
2) Using "a = a * Pa" yields similar results to that of directly using "a". For negative samples, it does not change the results. And for positive samples, it slightly moderates the predictions and has little effect on the final results.
3) Sure, you can calculate the number of positive samples for each category after line 148.
4) In fact, we tried to close the distance between features "x1" and "x_visual", and if we can remember clearly, it did not yield comparable results as they claimed in MA However, it can achieve similar results to the original paper by contrasting the features after HAN. So we modify the corresponding code. Whatever, our method can be combined with CL to achieve further improvement.
Thanks for your attention.
- We utilize modality-level predictions to filter out respective frame-level false positive events as we think video-level outputs might depend heavily on one modality while ignoring another.
- Using "a = a * Pa" yields similar results to that of directly using "a". For negative samples, it does not change the results. And for positive samples, it slightly moderates the predictions and has little effect on the final results.
- Sure, you can calculate the number of positive samples for each category after line 148.
- In fact, we tried to close the distance between features "x1" and "x_visual", and if we can remember clearly, it did not yield comparable results as they claimed in MA However, it can achieve similar results to the original paper by contrasting the features after HAN. So we modify the corresponding code. Whatever, our method can be combined with CL to achieve further improvement.
Thanks for your reply, I have just conducted experiments on the above doubts. 1、Utilizing modality-level predictions indeed has slightly better performance. 2、I got a bit of performance improvement when discarding "a = a * Pa". 4、I got similar results using "x_visual" or "x2" with your code. Each experiment I just carried out once and there may be some occasionalities. Thanks for your explanation again and look forward to your future work!
Thanks for your great work! Your code is easy to understand and follow, however, I am confused about some detailed implementations:
I notice that your eval function is slightly different with that in ECCV2020-AVVP, they use "output" as predicted weak labels to filter out false positive events:
o = (output.cpu().detach().numpy() >= 0.5).astype(np.int_)
Pa = (Pa >= 0.5).astype(np.int_) * np.repeat(o, repeats=10, axis=0)
Pv = (Pv >= 0.5).astype(np.int_) * np.repeat(o, repeats=10, axis=0)
while you use "a_prob" and "v_prob" instead:oa = (a_prob.cpu().detach().numpy() >= 0.5).astype(np.int_)
ov = (v_prob.cpu().detach().numpy() >= 0.5).astype(np.int_)
Pa = (Pa >= 0.5).astype(np.int_) * np.repeat(oa, repeats=10, axis=0)
Pv = (Pv >= 0.5).astype(np.int_) * np.repeat(ov, repeats=10, axis=0)
I wonder if this change matters and why you modify this code in this way?The lines 144 and 145 in your main.py, you use “a_prob”, "Pa", "v_prob", "Pv" to calculate noise ratio, I wonder why you use
a = a * Pa
v = v * Pv
instead of directly using "a" and "v"?The calculation of event_nums, I wonder if it is the same if I use
event_nums[c] += 1
after line 148: "if label[b][c] != 0:" ?