facebookresearch / co-tracker

CoTracker is a model for tracking any point (pixel) on a video.
https://co-tracker.github.io/
Other
2.52k stars 177 forks source link

the problem of evaluation #4

Closed qianduoduolr closed 11 months ago

qianduoduolr commented 11 months ago

Hi, thanks for your excellent work! In your paper, I found the results of PIPs in Table 2 are much higher (64.8% with DAVIS first) than my reproduced results (55.8%). I also noticed there is another paper [1] that gives their re-implemented results of PIPs (around 55.1% in Table 1), which is quite close to mine. Besides, for PIPs, the "strided" result (59.4%) In TAPIR [2] is even worse than your results with "first" version. I am wondering what makes the differences, did you re-train the PIPs with more GPUs or make other improvements ( e.g., improving the chaining algorithm in evaluation, using higher resolution for inference )?

[1] Context-TAP: Tracking Any Point Demands Spatial Context Features. Arxiv [2] TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement. Arxiv

nikitakaraevv commented 11 months ago

Hi @qianduoduolr, thank you for the question! TAP-Vid benchmarks require evaluation on 256x256 videos. We take 256x256 videos as input to PIPs and resize them to 384x512 because that's the resolution PIPs was trained on. After running PIPs, we convert trajectories back to 256x256 and compute the final numbers with this resolution. This gives much better performance compared to direct evaluation of PIPs on 256x256. When evaluated directly on 256x256, we have observed similar results to those presented in TAPIR and TAP-Vid.

qianduoduolr commented 11 months ago

Thanks for your reply, you're doing an excellent job. However, I think the thing we talk about needs to be clarified in Table 2, since some of the compared methods still evaluate under the 256x256 resolution all the time, which follows the benchmark of TAP-Vid, especially mentioned in https://github.com/deepmind/tapnet/blob/main/README.md#:~:text=Our%20readers%20also,lower%2Dresolution%20videos.

nikitakaraevv commented 11 months ago

Thank you! We mentioned it in the implementation details in the "evaluation" section, but perhaps we should also include this information in the table itself. The information discussed in the linked source differs from how we evaluate PIPs, as we don't evaluate full-resolution videos. Our input is the same 256x256 videos provided by the TAP-Vid dataloaders. We don't incorporate any additional details by upscaling them to 384x512; this process only aligns the benchmark resolution with the model training resolution.

qianduoduolr commented 11 months ago

Thanks for your reply. I think the resolution plays an important role in motion estimation. In my experiments, when we train the model (e.g., TAPNet, RAFT) with 256x256 resolution, using a higher resolution (e.g., 384x512 directly interpolated from 256x256 videos) for inference gets better results. I think it may be a fairer comparison to inference for TAPNet/TAPIR (except PIPs) with your data processing pipeline.
Anyway, your work is interesting and impresses me a lot.

nikitakaraevv commented 11 months ago

Thank you @qianduoduolr, we will think about this!