Zhongdao / Towards-Realtime-MOT

Joint Detection and Embedding for fast multi-object tracking
MIT License
2.38k stars 539 forks source link

Why use different functions to build the target #86

Closed kuanzi closed 4 years ago

kuanzi commented 4 years ago

Hello, Thanks for your work and the open-source. I have some questions:

why you use different functions to build the target, as you use build_target_thres function for regular training, but use build_target_max for embedding test. I have gone into the code, finding that:

  1. build_target_thres will choose the max_iou gt for every anchor
  2. build_target_max will choose the max_iou anchor for every gt

and then I check other implementations for yolov3, finding they keep the max_iou anchor for every gt in both training and inference procedures. So could you please explain why you coded this way?

Zhongdao commented 4 years ago

Hi @kuanzi, To test how good the embedding is, we hope to find ONE exact embedding vector, which is closest to a ground truth box center, to represent a person. If we use build_target_thres, embedding vectors at multiple locations would be assigned to a single gt box. This will lead to too many redundant embedding vectors, which is unnecessary and makes the retrieval test less challenging.

kuanzi commented 4 years ago

Hi @Zhongdao, Thanks for the explanation that made me understand why to use this method But I have further questions. During the training process, we should hope that a gt is responsible for an anchor, and this anchor will be trained to close to the gt (that is, a function similar to build_target_max should be implemented). If build_target_thres function is used during training(each anchor is looking for one gt), it will lead to two possible situations:

  1. In some occlusion or dense situations, the occluded gt A (which is occluded by other gts) will be discarded because their ious with the anchors are not large enough. As a result, some gts do not have anchor correspondence.
  2. On the other hand, the idea of build_target_thres may lead to a situation where one gt may correspond to multiple anchors. This is unreasonable.

So I wonder if you should also use build_target_max during training?

Zhongdao commented 4 years ago

Regarding the first issue, since we use build_target_thres during training, multiple anchors may be assigned with the same gt, and there hardly exists the situation that a certain gt does not have anchor correspondence. In contrast, the actual problem here is some assignment may be ambiguous (an anchor has equal overlaps w.r.t two gts (IOU>0.5) but assigned to one of them). This issue has not been solved yet. Regarding the second question, we find that assigning multiple anchors to one gt largely improves the recall of the detection branch. This effect may be not obvious in generic object detection, but we find it very important in the pedestrian scenario. This is the main reason why we use build_target_thres in the detection branch. As for the embedding branch, we find build_target_thres and build_target_max lead to similar performance. We guess build_target_thres introduces more training vectors for the embedding branch, and the performance gain from this mitigates the negative effect brought by inaccurate anchor assignments.

Zhongdao commented 4 years ago

These are very good questions, thank you for pointing them out so that I have a chance to explain!

kuanzi commented 4 years ago

Thanks very much for your patience. This problem has troubled me for a long time, and I am enlightened until your explanation. Thank you again!