chengche6230 / ReST

[ICCV 2023] ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking
MIT License
137 stars 15 forks source link

Inference Guideline #10

Closed durbin-164 closed 5 months ago

durbin-164 commented 6 months ago

Hi @chengche6230

I have found in the paper you use MVDeTr for inference. Could you please give some examples of how can I use ReST for inference in which video have no ground truth?

Thanks for helping.

chengche6230 commented 6 months ago

Hi @durbin-164

Yes, we used off-the-shelf object detectors like MVDeTr and YOLO in our experiments. Our ReST model is designed as a tracker and needs detection as input first. You can both use ground-truth detection or detection from other masterpiece detectors.

durbin-164 commented 6 months ago

Hi @chengche6230

Thanks for your quick reply. So, I need to first make a JSON file with any object detection model and then run the ReST tracker, right?

Is it possible to track in real-time, such as detect frame by frame and also track frame by frame not the whole video at a time?

Also, I tried to train the Wildtrack model. After separate training SG and TG, I got two models and used them for the test but I could not achieve good results. But your pre-train model works perfectly for the test. What could be the possible problem that I missed, could you help me?

For my trained model, I get this type of evaluation results which is not working at all. IDF1 IDP IDR Rcll Prcn GT MT PT ML FP FN IDs FM MOTA MOTP IDt IDa IDm 0 1.4% 1.4% 1.4% 100.0% 100.0% 9 9 0 0 0 0 612 0 1.4% 0.000 0 612 0 1 1.4% 1.4% 1.4% 100.0% 100.0% 8 8 0 0 0 0 544 0 1.4% 0.000 0 544 0 OVERALL 1.4% 1.4% 1.4% 100.0% 100.0% 17 17 0 0 0 0 1156 0 1.4% 0.000 0 1156 0

chengche6230 commented 6 months ago

Hi,

Yes, use the detection as input for the tracker. In our work, we focus on the data association part rather than designing an end-to-end model. It would be great if you could follow up on our research and create a graph-based end-to-end tracker.

As for the training issue, maybe you can try to do the data augmentation, i.e. generate more diverse graphs for both SG & TG. Take TG as an example, combine more frames and different combinations of cameras during training.

durbin-164 commented 5 months ago

Thanks a lot for helping me.