Hon-Wong / Elysium

[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
https://hon-wong.github.io/Elysium/
56 stars 2 forks source link

Evaluation of LaSOT #14

Open yangchris11 opened 2 weeks ago

yangchris11 commented 2 weeks ago

Just want to confirm my understanding is right.

In the evaluation of SOT performance of Elysium from otb.py: https://github.com/Hon-Wong/Elysium/blob/5e6d14ed6939cde3cbffaba5424d5d929c38e492/eval/otb.py#L172-L198

the final metric are average across a total of 98036 sequences (8-frame each). The result I got is matching (and higher) to the paper's reported number.

auc:  tensor([57.9670])
prec_score:  tensor([62.4371])
norm_prec_score:  tensor([53.4776])
Screenshot 2024-10-24 at 15 25 06

To me, the SUC and Precision on the LaSOT evaluating this way may be highly influenced by different # of frames of the sequences in the LaSOT testing set (longer sequences will be more dominant in such evaluation).

Will this be a fair comparison against other VOT trackers as they are evaluating on the entire sequence (then average across 280 sequences)?

Hon-Wong commented 2 weeks ago

Very good question!

Sorry that we did not give enough instructions. I just uploaded the script for merging clips and updated the readme for better instructions.

Please merge short clips into one video before you check everything. Also, make sure to get rid of any overlapping frames (the first frame of each clip except the first clip), so that you can evaluate Elysium in a way consistent with other VOT trackers. To achieve this, just run eval/merge_result.py following the instructions. Also, you can find the merged result here.

You are expected to get something like:

auc:  tensor([58.7632])
prec_score:  tensor([64.0076])
norm_prec_score:  tensor([54.4493])

If you're still having trouble reproducing the same result, feel free to ask me for more specific instructions, so that I can update the guidance in readme.

yangchris11 commented 2 weeks ago

Thank you for the merge file.

However I will like to raise another issue I found in the otb.py. I may be missing something, please kindly correct me if I am wrong.

In https://github.com/Hon-Wong/Elysium/blob/d5919041fc0d2e5a53dfb012b3b526f6113cf0c8/eval/otb.py#L183-L184 the tlbr_to_tlwh function is https://github.com/Hon-Wong/Elysium/blob/d5919041fc0d2e5a53dfb012b3b526f6113cf0c8/eval/otb.py#L131-L138

However I think the original tlbr had already gone through a scaling process in the earlier lines https://github.com/Hon-Wong/Elysium/blob/d5919041fc0d2e5a53dfb012b3b526f6113cf0c8/eval/otb.py#L173-L176 therefore making the .clamp(1,100) a bit absurd as it should not be clamp to (1, 100) as it is now in pixel representation.

If remove the .clamp(1,100)clipping, I got

auc:  tensor([30.8999])
prec_score:  tensor([23.0065])
norm_prec_score:  tensor([27.7496])

from running the otb.py with the merged json.