This repo is the official implementation of Uncertainty-aware Unsupervised Multi-Object Tracking
Without manually annotated identities, unsupervised multi-object trackers are inferior to learning reliable feature embeddings. It causes the similarity-based inter-frame association stage also be error-prone, where an uncertainty problem arises. The frame-by-frame accumulated uncertainty prevents trackers from learning the consistent feature embedding against time variation. To avoid this uncertainty problem, recent self-supervised techniques are adopted, whereas they failed to capture temporal relations. The inter-frame uncertainty still exists. In fact, this paper argues that though the uncertainty problem is inevitable, it is possible to leverage the uncertainty itself to improve the learned consistency in turn. Specifically, an uncertainty-based metric is developed to verify and rectify the risky associations. The resulting accurate pseudo-tracklets boost learning the feature consistency. And accurate tracklets can incorporate temporal information into spatial transformation. This paper proposes a tracklet-guided augmentation strategy to simulate the tracklet’s motion, which adopts a hierarchical uncertainty-based sampling mechanism for hard sample mining. The ultimate unsupervised MOT framework, namely U2MOT, is proven effective on MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance among the published supervised and unsupervised trackers.
Step1. Install u2mot (verified with PyTorch=1.8.1).
git clone https://github.com/alibaba/u2mot.git
cd u2mot
python -m pip install -r requirements.txt
python setup.py develop
Step2. Install pycocotools.
python -m pip install cython
python -m pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
Step3. Others
python -m pip install cython_bbox
Download MOT17, MOT20, CrowdHuman, Cityperson, ETHZ, VisDrone-MOT, BDD100K-MOT(optional)and put them under
datasets
|——————MOT17/images
| └——————train
| └——————test
└——————MOT20/images
| └——————train
| └——————test
└——————CrowdHuman
| └——————images
| └——————annotation_train.odgt
| └——————annotation_val.odgt
└——————Cityscapes
| └——————images
| └——————labels_with_ids
└——————ETHZ
| └——————eth01
| └——————...
| └——————eth07
└——————VisDrone-MOT
| └——————VisDrone2019-MOT-train
| └——————VisDrone2019-MOT-val
| └——————VisDrone2019-MOT-test-dev
└——————BDD100K-MOT (optional)
└——————images
└——————labels
Then, you need to turn the datasets to COCO format, and the results will be saved in datasets/<dataset>/annotations
:
python tools/data/convert_mot17_to_coco.py
python tools/data/convert_mot20_to_coco.py
python tools/data/convert_crowdhuman_to_coco.py
python tools/data/convert_cityperson_to_coco.py
python tools/data/convert_ethz_to_coco.py
python tools/data/convert_visdrone_to_coco.py
python tools/data/convert_bdd100k_to_coco.py
Before mixing different datasets, you need to follow the operations in tools/data/mix_data_xxx.py
to create data folders and soft-links.
Finally, you can mix the training data for MOT-Challenge benckmarks (no extra training data is needed for VisDrone-MOT and BDD100K-MOT benchmark):
python tools/data/mix_data_test_mot17.py
python tools/data/mix_data_test_mot20.py
The pre-trained weights are provided blow:
Benchmark | Split | Weights | HOTA | MOTA | IDF1 | IDs |
---|---|---|---|---|---|---|
MOT17 | test | link | 64.5 | 80.9 | 78.8 | 1590 |
MOT20 | test | link | 62.7 | 76.7 | 76.5 | 1552 |
VisDrone | test | link | - | 55.9 | 70.0 | 1282 |
BDD100K | val | link | 58.7 | 62.9 | 68.9 | 16191 |
BDD100K | test | link | 58.3 | 63.0 | 69.2 | 29985 |
Since we adopt several tracking tricks from BoT-SORT, some of the results are slightly different from the performance reported in the paper.
For higher results, you may carefully tune the tracking parameters of each sequence, including detection score threshold, matching threshold, etc.
The COCO pretrained YOLOX model can be downloaded from their model zoo. After downloading the pretrained models, you can put them under <u2mot_HOME>/pretrained
.
python tools/train.py -f exps/example/u2mot/yolox_x_ablation_u2mot17.py -d 2 -b 12 --fp16 -o -c pretrained/yolox_x.pth.tar
python tools/train.py -f exps/example/u2mot/yolox_x_mix_u2mot17.py -d 2 -b 12 --fp16 -o -c pretrained/yolox_x.pth.tar
For MOT20, VisDrone-MOT, and BDD100K-MOT, you need to clip the bounding boxes inside the image.
Specifically, set _clip=True
for yolox/data/data_augment.py:line177
, yolox/data/datasets/mosaicdetection.py:line86/358
, and yolox/utils/boxes.py:line149
.
python tools/train.py -f exps/example/u2mot/yolox_x_mix_u2mot20.py -d 2 -b 12 --fp16 -o -c pretrained/yolox_x.pth.tar
python tools/train.py -f exps/example/u2mot/yolox_x_u2mot_visdrone.py -d 2 -b 12 --fp16 -o -c pretrained/yolox_x.pth.tar
python tools/train.py -f exps/example/u2mot/yolox_x_u2mot_bdd100k.py -d 2 -b 12 --fp16 -o -c pretrained/yolox_x.pth.tar
To alleviate the optimization problem on detection head and reid head, you can train the detector (following ByteTrack) first, and then train the reid head only:
python tools/train.py -f exps/example/u2mot/your_exp_file.py -d 2 -b 12 --fp16 -o -c pretrained/your_detector.pth.tar --freeze
First, you need to prepare your dataset in COCO format. You can refer to MOT-to-COCO or VisDrone-to-COCO.
Second, you need to create a Exp file for your dataset. You can refer to the MOT17 training Exp file. Don't forget to modify get_data_loader()
and get_eval_loader()
in your Exp file.
Third, modify the configuration at img_path2seq()
, check_period()
, and get_frame_cnt()
functions to parse your image path and video info in yolox/data/datasets/mot.py
.
Finally, you can train u2mot on your dataset by running:
python tools/train.py -f exps/example/u2mot/your_exp_file.py -d 2 -b 12 --fp16 -o -c pretrained/yolox_x.pth.tar
Performance on MOT17 half val is evaluated with the official TrackEval (the configured code has been provided at <u2mot_HOME>/TrackEval
).
First, run u2mot
to get the tracking results, which will be saved in pretrained/u2mot/track_results
:
python tools/track.py datasets/MOT17/images/train --benchmark MOT17-val -f exps/example/u2mot/yolox_x_ablation_u2mot17.py -c pretrained/u2mot/ablation.pth.tar --device 0 --fp16 --fuse
To leverage UTL
in inference stage, just add the --use-uncertainty
flag:
python tools/track.py datasets/MOT17/images/train --benchmark MOT17-val -f exps/example/u2mot/yolox_x_ablation_u2mot17.py -c pretrained/u2mot/ablation.pth.tar --device 0 --fp16 --fuse --use-uncertainty
Then, run TrackEval
to evaluate the tracking performance:
cd ./TrackEval
python scripts/run_mot_challenge.py --BENCHMARK MOT17 --SPLIT_TO_EVAL train --TRACKERS_TO_EVAL ../../../../../YOLOX_outputs/yolox_x_ablation_u2mot17 --TRACKER_SUB_FOLDER track_res --OUTPUT_SUB_FOLDER track_eval --METRICS HOTA CLEAR Identity --USE_PARALLEL False --NUM_PARALLEL_CORES 1 --SEQMAP_FILE data/gt/mot_challenge/seqmaps/MOT17-train.txt --GT_LOC_FORMAT '{gt_folder}/{seq}/gt/gt.half_val.txt' --PRINT_CONFIG False --PRINT_ONLY_COMBINED True --DISPLAY_LESS_PROGRESS True
Run u2mot
, and the results will be saved in YOLOX_outputs/yolox_x_mix_u2mot17/track_res
:
python tools/track.py datasets/MOT17/images/test --benchmark MOT17 -f exps/example/u2mot/yolox_x_mix_u2mot17.py -c pretrained/u2mot/mot17.pth.tar --device 0 --fp16 --fuse --cmc-method file --cmc-file-dir MOTChallenge
python tools/interpolation.py --txt_path YOLOX_outputs/yolox_x_mix_u2mot17/track_res --save_path YOLOX_outputs/yolox_x_mix_u2mot17/track_res_dti
Submit the txt files under track_res_dti
to MOTChallenge website for evaluation.
We use the input size 1600 x 896 for MOT20-04, MOT20-07 and 1920 x 736 for MOT20-06, MOT20-08.
Run u2mot
:
python tools/track.py datasets/MOT20/images/test --benchmark MOT20 -f exps/example/u2mot/yolox_x_mix_u2mot20.py -c pretrained/u2mot/mot20.pth.tar --device 0 --fp16 --fuse --cmc-method file --cmc-file-dir MOTChallenge
python tools/interpolation.py --txt_path YOLOX_outputs/yolox_x_mix_u2mot20/track_res --save_path YOLOX_outputs/yolox_x_mix_u2mot20/track_res_dti
Submit the txt files under track_res_dti
to MOTChallenge website for evaluation.
We use the input size 1600 x 896 for VisDrone-MOT benchmark. Following TrackFormer the performance is evaluated with the default motmetrics.
Run u2mot
, and the results will be saved at YOLOX_outputs/yolox_x_u2mot_visdrone/track_res
:
python tools/track.py datasets/VisDrone-MOT/VisDrone2019-MOT-test-dev/sequences --benchmark VisDrone -f exps/example/u2mot/yolox_x_u2mot_visdrone.py -c pretrained/u2mot/visdrone.pth.tar --device 0 --fp16 --fuse --cmc-method file --cmc-file-dir VisDrone/test-dev
Evaluate the results:
python tools/utils/eval_visdrone.py
You will get the performance on MOTA, IDF1 and ID switch.
Run u2mot
, and the results will be saved at YOLOX_outputs/yolox_x_u2mot_bdd100k/track_res
:
python tools/track.py datasets/BDD100K-MOT/images/test --benchmark BDD100K -f exps/example/u2mot/yolox_x_u2mot_bdd100k.py -c pretrained/u2mot/bdd100k.pth.tar --device 0 --fp16 --fuse --cmc-method file --cmc-file-dir BDD100K/{split}
Then, convert the result files into json
format, zip those json files, and submit to the official website EvalAI to get the traking performance:
python tools/utils/convert_bdd.py
cd YOLOX_outputs/yolox_x_u2mot_bdd100k
zip -r -q bdd100k_pred.zip ./track_res_json
Be careful to evaluate on val
and test
splits separately.
@inproceedings{liu2023u2mot,
title={Uncertainty-aware Unsupervised Multi-Object Tracking},
author={Liu, Kai and Jin, Sheng and Fu, Zhihang and Chen, Ze and Jiang, Rongxin and Ye, Jieping},
booktitle={International Journal of Computer Vision},
year={2023}
}
A large part of the code is borrowed from YOLOX, FairMOT, ByteTrack, ByteTrack_ReID, and BoT-SORT. Many thanks for their wonderful works.