Understanding re identification in BoT-SORT in multi object tracking

I was trying to use BoT-SORT with reid on a simple video in which single person is walking on the road, first gets occluded by small tree and then by a billboard. Also this is a drone footage though not at very high altitude and

I am using yolox for detection model with weights from bytetrack_x_mot17.pth.tar and reid model (which is mot17_sbs_S50.pth). This is what paper and code base also uses by default.

BoT-SORT was able to correctly able to recognise same person when he emerges out of tree. However, when he emerges out of bill board, he gets new ID assigned. I tried by increasing track_buffer, proximity_threshold, appearance_threshold as well as match_threshold, but no luck.

So, I tried to debug the code. Here are my observations: For long occlusions (like billboard) iou similarity inside matching.iou_distance() method is [0] (single zero for single person detection). This makes ious_dists = [1] (line 6 in below code excerpt from official BotSORT repo). For long occlusions, appearance similarity is also turns out to be [0], making emb_dists = [1] (line 13). This makes overall dists = [1]. Now this dists is passed to matching function on line 30. Since I set match_thresh to 0.6 which was less than 1, it did not match / associate any existing tracklet with the detection bounding box corresponding to person arising out of bill board, thus assigning new ID to the person.

1    class BoTSORT():
2        def update(self, output_results, img):
3            // ...
4            
5            # Associate with high score detection boxes
6            ious_dists = matching.iou_distance(strack_pool, detections)  # this is all 1s for long occlusions
7            ious_dists_mask = (ious_dists > self.proximity_thresh)
8    
9            if not self.args.mot20:
10                ious_dists = matching.fuse_score(ious_dists, detections)
11    
12            if self.args.with_reid:
13                emb_dists = matching.embedding_distance(strack_pool, detections) / 2.0 # this is all 1s for long occlusions
14                raw_emb_dists = emb_dists.copy()
15                emb_dists[emb_dists > self.appearance_thresh] = 1.0 
16                emb_dists[ious_dists_mask] = 1.0 
17                dists = np.minimum(ious_dists, emb_dists) 
18    
19                # Popular ReID method (JDE / FairMOT)
20                # raw_emb_dists = matching.embedding_distance(strack_pool, detections)
21                # dists = matching.fuse_motion(self.kalman_filter, raw_emb_dists, strack_pool, detections)
22                # emb_dists = dists
23    
24                # IoU making ReID
25                # dists = matching.embedding_distance(strack_pool, detections)
26                # dists[ious_dists_mask] = 1.0
27            else:
28                dists = ious_dists
29    
30            matches, u_track, u_detection = matching.linear_assignment(dists, thresh=self.args.match_thresh)

So I increased match_thresh to 1.1 and it started working. However this is just the hack, since the thresholds are meant range between 0 to 1 and setting it anything bigger than 1.1 effectively means: if all dists have values 1s, match existing tracks with anything that appears in the scene. If a new person appears in the scene before the occluded person comes out of billboard, that new person gets assigned with older persons ID !

I observed same when there are multiple people in the scene occluded by some object. If a new person comes in the scene before any of occluded person gets un-occluded, that person gets assigned occluded person's ID.

I have following questions:

Q1. Why ious_dists = [1]? Because there is no overlaps between bounding boxes of before and after occlusions?

Q2. Why emb_dists = [1]? Because, the reid model is able to generate similar features for same person before and after occlusion?

Q3. If answer to Q2 is Yes, then do I need use reid model fine tuned on my dataset? Just for reference, BoTSORT paper says this:

For the feature extractor, we trained FastReID’s [19] SBS-50 model for MOT17 and MOT20 with their default training strategy, for 60 epochs.

while fast-reid paper says this:

We propose a cross-domain method FastReIDMLT that adopts mixture label transport to learn pseudo label by multi-granularity strategy. We first train a model with a source-domain dataset and then finetune on the pre-trained model with pseudo labels of the target-domain dataset.

Q4. If answer to Q3, is Yes, then is there any approach / model that allows us to do re-identification without fine tuning re-id model?

Q5. I am also doubt ful about above part of code. If ious_dists is all 1s (line 6), ious_dists_mask will become all True (line 7), which will make emb_dists all 1s on line 16, making dists all 1s on line 17. My understanding was that we should be using appearance similarity for long occlusions, but here zero IoU similarity is nullifying appearance similarity for long occlusions. Isnt it wrong? Or am missing something?

NirAharon / BoT-SORT

Understanding re identification in BoT-SORT in multi object tracking #113