JialianW / TraDeS

Track to Detect and Segment: An Online Multi-Object Tracker (CVPR 2021)
MIT License
553 stars 108 forks source link

Feature aggregration #29

Closed Leo63963 closed 3 years ago

Leo63963 commented 3 years ago

Hi

Just one question about the features used here. It seems that, for the _inferenceprehm here, which stores the previous heatmaps that would be used for feature aggregation in the proposed MFW. The code used as follows:

https://github.com/JialianW/TraDeS/blob/3eafd249ca0f18af8000d5798d4c552a0bd627ec/src/lib/detector.py#L351

https://github.com/JialianW/TraDeS/blob/3eafd249ca0f18af8000d5798d4c552a0bd627ec/src/lib/detector.py#L354

And it looks like that the same heatmaps from the previous frame (t-1) are used twice. And by debug, it has been proved.

I personally think that, in TraDeS pipeline, you would like to use the features and corresponding heatmaps from (t-1) and (t-2), instead of repeated (t-1), for the aggregations of features. However, it looks that it is not.

Just kindly ask, are there some hidden tricks here? Many thanks.

JialianW commented 3 years ago

This only happens in the beginning of a video, where there are no enough previous frames to be used for propagation. For example, when you are at the second frame and you want to aggregate three frames, you have to repeat the first frame. self.inference_prehm is a cache. Once it is full of the clip_len frames, it will no longer do the repeat.

Leo63963 commented 3 years ago

Thanks for the reply. Yes, I am totally aware of that. For the start of a video, it is the case. However, by debugging, the features in the self.inference_prehm are the same, which means _self.inference_prehm[0] == self.inferenceprehm[1] holds for not only the first few sequences, but all the time, with unknown reason. Could you re-check that? Thanks

JialianW commented 3 years ago

If len(self.inference_prehm) == (self.opt.clip_len - 1), it shouldn't get into the loop and should only append the hm once.

Leo63963 commented 3 years ago

Thanks for the reply. Just one follow-up please. On the loss functions below: https://github.com/JialianW/TraDeS/blob/3eafd249ca0f18af8000d5798d4c552a0bd627ec/src/lib/model/losses.py#L115 I could not get why (1-target) here. I am working with your code recently, really admire that. Thanks.

JialianW commented 3 years ago

This mainly allows the pixels near around the target to less attend the softmax computation, so as to less penalize them. The reason to do this is that it is too harsh for the network to categorize two adjacent pixles into two groups. The pixels near around the target in the previous frame could also be a part of the target. So we may not want to regard them as other objects or background. I haven't tested the code without this. This implementation is intutitive. Not sure if it will be worse without it.

Leo63963 commented 3 years ago

Actually I still cannot get that. But thanks anyway, the work is really great.