Reproducing results - Githubissues

SwinTransformer / Video-Swin-Transformer

This is an official implementation for "Video Swin Transformers".

https://arxiv.org/abs/2106.13230

Apache License 2.0

1.44k stars 200 forks source link

Reproducing results #18

Open zehzhang opened 3 years ago

zehzhang commented 3 years ago

Hi,

Thanks for the great work.

I'm having the same issue as #5 even when I tested the models with the same val split.

I played with SwinT and SwinB, and both of them gave 04%~0.5% lower top-1 accuracy than reported. They are still pretty neat but I just want to make sure I am not doing anything wrong.

Would you confirm that the models and the split files uploaded are the correct ones?

Also, if anyone has successfully reproduce it, please kindly comment here about whether there is anything else I need to do besides downloading the models and configs and run the test scripts.

Thanks,

hust-nj commented 3 years ago

Hi, before training and testing, the Kinetics400 training and validation datasets we use are preprocessed by making height of video 256, this may cause little difference. You can contact us by v-jianing@microsoft.com for discussing more details.

zehzhang commented 3 years ago

Facing the same problem. I use the annotation files in this repo and test the provided pretrain models without any extra modification. Swin-T and Swin-B achieved 78.4% and 80.1% top-1 accuracy on K400, which seemed slightly worse than the reported results. @zehzhang Have you resolved this issue? If so, could you please share your solution?

Thanks for confirming the problem. I got similar decrease with SwinT (-0.4% top1 acc) and SwinB pretrained on ImageNet21k (-0.5% top1 acc). I'm reaching out to the other first co-author (referred to by @hust-nj ) and hopefully will figure out what is going on soon. I will keep this thread updated.

hust-nj commented 3 years ago

After a careful comparison, we find out that the performance gap is due to the slight difference on data. Our kinetics-400 data with 256 resolution was obtained (with broken video removed) from nonlocal networks which was also used in many other series of works.

More details and data download link can be found here https://github.com/youngwanLEE/VoV3D/blob/main/DATA.md#kinetics-400, https://github.com/facebookresearch/video-nonlocal-net/issues/67

dragen1860 commented 2 years ago

@zehzhang Hi, Thanks for your issues. Do you train video-swin from scratch without imagenet-21k, did it drop severely? thank you.