Open yangchris11 opened 1 month ago
Very good question!
Sorry that we did not give enough instructions. I just uploaded the script for merging clips and updated the readme for better instructions.
Please merge short clips into one video before you check everything. Also, make sure to get rid of any overlapping frames (the first frame of each clip except the first clip), so that you can evaluate Elysium in a way consistent with other VOT trackers. To achieve this, just run eval/merge_result.py following the instructions. Also, you can find the merged result here.
You are expected to get something like:
auc: tensor([58.7632])
prec_score: tensor([64.0076])
norm_prec_score: tensor([54.4493])
If you're still having trouble reproducing the same result, feel free to ask me for more specific instructions, so that I can update the guidance in readme.
Thank you for the merge file.
However I will like to raise another issue I found in the otb.py
. I may be missing something, please kindly correct me if I am wrong.
In
https://github.com/Hon-Wong/Elysium/blob/d5919041fc0d2e5a53dfb012b3b526f6113cf0c8/eval/otb.py#L183-L184
the tlbr_to_tlwh
function is
https://github.com/Hon-Wong/Elysium/blob/d5919041fc0d2e5a53dfb012b3b526f6113cf0c8/eval/otb.py#L131-L138
However I think the original tlbr had already gone through a scaling process in the earlier lines
https://github.com/Hon-Wong/Elysium/blob/d5919041fc0d2e5a53dfb012b3b526f6113cf0c8/eval/otb.py#L173-L176
therefore making the .clamp(1,100)
a bit absurd as it should not be clamp to (1, 100) as it is now in pixel representation.
If remove the .clamp(1,100)
clipping, I got
auc: tensor([30.8999])
prec_score: tensor([23.0065])
norm_prec_score: tensor([27.7496])
from running the otb.py with the merged json.
Just want to confirm my understanding is right.
In the evaluation of SOT performance of Elysium from
otb.py
: https://github.com/Hon-Wong/Elysium/blob/5e6d14ed6939cde3cbffaba5424d5d929c38e492/eval/otb.py#L172-L198the final metric are average across a total of 98036 sequences (8-frame each). The result I got is matching (and higher) to the paper's reported number.
To me, the SUC and Precision on the LaSOT evaluating this way may be highly influenced by different # of frames of the sequences in the LaSOT testing set (longer sequences will be more dominant in such evaluation).
Will this be a fair comparison against other VOT trackers as they are evaluating on the entire sequence (then average across 280 sequences)?