OscarXZQ / weight-selection

164 stars 11 forks source link

Any special settings to reproduce the accuracy on CIFAR-100? #5

Closed osiriszjq closed 2 months ago

osiriszjq commented 2 months ago

Thanks for your impressive work! I'm also interested in studying the initialization of ViTs. However, when I clone your repo and run these two commands,

python3 weight_selection.py \
--output_dir /path/to/weight_selection/ \
--model_type vit \
--pretrained_model vit_small_patch16_224_in21k
python main.py \
--model vit_tiny  --warmup_epochs 50 --epochs 300 \
--batch_size 64 --lr 2e-3 --update_freq 1 --use_amp true \
--initialize /path/to/weight_selection \
--data_path /path/to/data/ \
--data_set CIFAR100 \
--output_dir /path/to/results/

I can only get 77.6% instead of 81.4% as in your paper. I think I only changed the number of GPUs from 8 to 1, which should not make this big difference. Are there any special settings in your training? I noticed that in your code you said you used weight decay scheduler, but there is no specific number. Could this cause the difference? If you have any thoughts about this, that will be great help.

OscarXZQ commented 2 months ago

Hi, thanks for your interest in our work. The effective batch size is num_of_GPUs batch_size update_freq. Therefore, when you reduce num of GPUs used from 8 to 1, you will also need to set the batch size to 64 * 8 = 512 to make it equivalent to our setting. Feel free to lemme know if this fixes the problem.

So an equivalent single-GPU command should be:

python main.py \
--model vit_tiny  --warmup_epochs 50 --epochs 300 \
--batch_size 512 --lr 2e-3 --update_freq 1 --use_amp true \
--initialize /path/to/weight_selection \
--data_path /path/to/data/ \
--data_set CIFAR100 \
--output_dir /path/to/results/
osiriszjq commented 2 months ago

Oh! Actually, I changed the batch size to 512. It's a typo. One problem is when I run this command

python main.py \
--model vit_tiny  --warmup_epochs 50 --epochs 300 \
--batch_size 512 --lr 2e-3 --update_freq 1 --use_amp true \
--initialize /path/to/weight_selection \
--data_path /path/to/data/ \
--data_set CIFAR100 \
--output_dir /path/to/results/

After one epoch training, it will raise up an error says

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

So I just comment line 132 in engine.py, which is

    metric_logger.synchronize_between_processes()

I guess when I use one process there is no need to synchronize? Do you have any other suggestions? Anyway the accuracy of random init is also about 4% less than it in your paper, so the effectiveness of your method is still about 9% increasing. I just want to find a reliable training setting of ViT on small dataset to compare things.

OscarXZQ commented 2 months ago

Hi,

Given that everything is correct on my end and on your end, one explanation between the difference of using 8 GPUs and 1 GPU is that the strength of mixup and cutmix would be different since mixup and cutmix are applied on a per-GPU basis. Therefore, using 1 GPU increases the strength of mixup/cutmix, which might cause the performance to drop.

If it's impossible to use 8GPU on your end, one trick could be reducing the strength of mixup and cutmix. Maybe you can try to add flags like --mixup 0.3 --cutmix 0.3. I did not tune mixup and cutmix while developing this project (I just used the same setting as ConvNeXt), but I think it might help with your case.

osiriszjq commented 2 months ago

Thanks! When I use 8 GPUs I can get the same performance now. Never thought this could make such a big difference.