changsn / STViT-R

This is an official implementation for "Making Vision Transformers Efficient from A Token Sparsification View".
MIT License
31 stars 3 forks source link

Question regarding reproducing the Top-1 accuracy of Swin-Small on ImageNet #6

Open jameslahm opened 3 months ago

jameslahm commented 3 months ago

Thanks for your great work! I just noticed that USE_LAYER_SCALE is actually not used during training, as shown in https://github.com/changsn/STViT-R/blob/d1532e8b74a72c714669bc7201e7fee2089718c4/models/build.py#L15-L35 Therefore, the provided config for Swin-Small by default in #5 is the same as yours. Is the line use_layer_scale=config.USE_LAYER_SCALE missing here? Thanks.

changsn commented 3 months ago

Hi, OMG, you are right. I have checked the code and checkpoint again. use_layer_scale is not passed in the model. It means that all your hyper-parameters are as the same as mine, but there is a 0.3% accuracy gap. Can I ask the global batch size when you run the code? My batch size is 64x16. The learning rate is scaled according to the global batch size. Different batch size may cause the bad influence for the result.

jameslahm commented 3 months ago

Thanks for your reply! My batch size is 128*8, according to the default training command.

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345  main.py \
--cfg configs/swin_small_patch4_window7_224.yaml --data-path <imagenet-path> --batch-size 128 
changsn commented 3 months ago

I just now updates the log files here https://github.com/changsn/STViT-R/tree/main/log. Can they give you more helps?

jameslahm commented 3 months ago

Thanks a lot! I will check the difference between the log files. BTW, would you mind giving me some guidance about the semantic segmentation task in #7? I'd appreciate it very much.