AssafSinger94 / dino-tracker

Official Pytorch Implementation for “DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video”
MIT License
361 stars 39 forks source link

Benchmark results different from the original papers #15

Closed serycjon closed 1 month ago

serycjon commented 2 months ago

Hi, I was looking at the Table 1 in the paper (https://arxiv.org/pdf/2403.14548) and it seems to me that the results reported there do not agree with the numbers published in the respective papers. Did you re-run all the methods yourself and re-evaluate them? Do you have some explanation of the differences? And if you did re-run everything, would you please share the raw results (point-tracks and occlusion masks)?

Some examples on DAVIS (delta_avg, OA, AJ):

method original paper yours
TAP-Net 53.1, 82.3, 38.4 53.4, 81.4, 38.4
CoTracker(v1*) 79.1, 88.7, 64.8 79.2, 89.3, 65.1
TAPIR 73.6, 88.8, 61.3 74.7, 89.4, 62.8

(* CoTrackerv2 is also different)

The differences are not huge, but not exactly tiny in all cases either. I think it is still important to know what is going on.

AssafSinger94 commented 2 months ago

Hi Jonas, thank you for your question. We evaluated all methods ourselves using their official repos.

Evaluating TAP-Net on TAP-Vid DAVIS yielded: Eval scalars: {'average_jaccard': array(0.3835836, dtype=float32), 'average_pts_within_thresh': array(0.5339951, dtype=float32), occlusion_accuracy': array(0.81390005, dtype=float32), Evaluating TAP-IR on TAP-Vid DAVIS yielded: Eval scalars: {'average_jaccard': array(0.6281295, dtype=float32), 'average_pts_within_thresh': array(0.7466243, dtype=float32), 'occlusion_accuracy': array(0.89445394, dtype=float32).

We evaluated CoTracker(v2) using their repo. Their code however resizes target and query points coordinates to [0,255] range instead of [0, 256], this issue still exists in their repo. Fixing this issue produced the following results. "average_jaccard": 0.64614, "average_pts_within_thresh": 0.79159, "occlusion_accuracy": 0.88447,

serycjon commented 2 months ago

Hi, thanks for a very fast response. Good job trying to replicate their results. Do you have some idea why the results are different? Are the method non-deterministic (running the evaluation twice gives slightly different results), or do you think it is caused by different software or hardware versions?

tnarek commented 1 month ago

hi @serycjon, the process should be deterministic. I think the evaluation code of some of these repos underwent a few changes (such as scaling the points differently), so this may cause a slight discrepancy in the results.