evaluation results on BADJA don't match the paper.

AssafSinger94 commented 1 year ago

Hi, When trying to evaluate the model on BADJA, I am getting different results then reported in the paper. The results are as follows (I added the avg. results at the end of the dictionary):

{
    "bear": 88.57142639160156,
    "bear_accuracy": 20.357141494750977,
    "camel": 90.35369873046875,
    "camel_accuracy": 22.186494827270508,
    "cows": 86.89839935302734,
    "cows_accuracy": 31.283422470092773,
    "dog": 54.59769821166992,
    "dog-agility": 6.896551609039307,
    "dog-agility_accuracy": 0.0,
    "dog_accuracy": 4.597701072692871,
    "horsejump-high": 62.25165557861328,
    "horsejump-high_accuracy": 17.218544006347656,
    "horsejump-low": 62.30366897583008,
    "horsejump-low_accuracy": 27.74869155883789,
    "avg": 64.55329983575004,
    "avg acc 3px": 17.627427918570383,
    "time": 576.1103093624115
}

I was able to evaluate the model on TAP-Vid DAVIS properly. I ran the following code.

python ./cotracker/evaluation/evaluate.py 
--config-name eval_badja \
exp_dir=./eval_outputs_badja \
dataset_root=<path_to_BADJA_dir> \

Could you please assist me in the matter? In addition, I see that the "extra_videos" as referred to in BADJA are not being evaluated, and I see that they are being explicitly ignored during dataset creation. Could you please explain to me why are they not being evaluated?

Thank you for your help! Assaf

nikitakaraevv commented 1 year ago

Hi @AssafSinger94, the numbers reported in the paper are 63.6 and 18.0, whereas you have 64.6 and 17.6, where seg-based accuracy is better, 3px accuracy is worse. This small difference could be due to using different versions of some libraries, especially since BADJA is just a small set of 7 short videos. In this evaluation, we follow PIPs, so these videos are not included to keep the numbers consistent.

AssafSinger94 commented 1 year ago

Thank you for your reply! @nikitakaraevv One more thing I wanted to ask you. I see that you are always sampling the trajectories at frame 0 for the query points, although a few of the trajectories (not many), are occluded on frame 0. Is that how the query points are supposed to be sampled on BADJA? Isn't it supposed to be like TAP-Vid with query-mode='first', where you sample the first non-occluded frame for the trajectory? perhaps I misunderstood something in the paper.

Thank you very much for your assistance and responsivity! Assaf

nikitakaraevv commented 1 year ago

@AssafSinger94 You're right, it is supposed to function like in TAP-Vid. The results for this benchmark should be slightly better after fixing this bug.

nikitakaraevv commented 11 months ago

We do not evaluate CoTracker on BADJA in the new version of the paper because BADJA is only a subset of DAVIS

LHY-HongyangLi commented 9 months ago

Hi @nikitakaraevv, I tried to reproduce the performance of cotrackerv2 using the checkpoint you have provided on DAVIS-First in "glob. 5×5" mode. But I got the following results, which are not the same as you have posted in table3, is there anything wrong?

evaluate_result {'occlusion_accuracy': 0.8830991955685764, 'pts_within_1': 0.41818242347516404, 'jaccard_1': 0.27439414118807914, 'pts_within_2': 0.6585306168489217, 'jaccard_2': 0.4944025070918661, 'pts_within_4': 0.8213521434366008, 'jaccard_4': 0.67601993042582, 'pts_within_8': 0.900469386477676, 'jaccard_8': 0.7759953203865924, 'pts_within_16': 0.9396396614828214, 'jaccard_16': 0.8167342923773883, 'average_jaccard': 0.607509238293949, 'average_pts_within_thresh': 0.7476348463442369}

facebookresearch / co-tracker

evaluation results on BADJA don't match the paper. #46