too sensitivve to hyperparameters?

ck-amrahd commented 2 years ago

Hi, I trained your code on Imagenet-1k from scratch with your config file (mobilevit-small) with only one change: a new batch size of 32/GPU with an effective batch size of 32*4. I get top-1 accuracy of 74.23 on the Imagenet validation set. I suppose this will change the iterations warmup iterations, which I didn't change. But, I got the exact accuracy for MobileNetV2 with the same changes [just changing the batch size and without changing warmup itenrations]. Is it because transformers are susceptible to hyperparameter changes or you didn't notice such issues? Thank you.

sacmehta commented 2 years ago

Hi,

Thanks for you interest in our work.

Both CNNs and Transformers are sensitive to batch size. Sensitivity of CNNs to batch size has been reported in many previous works. For instance, see Group Normalization paper.

We did not try MobileViT with smaller batch sizes. We used standard settings (effective batch size of 1024 images and 300 epochs) for training MobileViT networks.

Effective batch size of 1024 can be achieved as:

8 GPUs * 128 images per GPU = 1024 (this is what we used)
4 GPUs 32 images per GPU 8 gradient accumulation freq. = 1024

To enable gradient accumulation with a frequency of 8, you can use --common.accum-freq 8

I would encourage you to train MobileViT with gradient accumulation option.

ck-amrahd commented 2 years ago

Hi @sacmehta, Thank you for the reply. I will perform another experiment with your suggestion. Also, I suppose I will have to increase the warmup iterations as well. What value do you suggest for that? Thank you.

ck-amrahd commented 2 years ago

Hi @sacmehta, I did run the training with the --common.accum-freq 8 and the results are still similar. I got around 74.9% top-1 accuracy which is far low than what is mentioned in the paper. I am using following parameters:

image_augmentation: random_resized_crop: enable: true interpolation: "bilinear" random_horizontal_flip: enable: true sampler: name: "variable_batch_sampler" vbs: crop_size_width: 256 crop_size_height: 256 max_n_scales: 5 min_crop_size_width: 160 max_crop_size_width: 320 min_crop_size_height: 160 max_crop_size_height: 320 check_scale: 32 loss: category: "classification" classification: name: "label_smoothing" label_smoothing_factor: 0.1 optim: name: "adamw" weight_decay: 0.01 no_decay_bn_filter_bias: false adamw: beta1: 0.9 beta2: 0.999 scheduler: name: "cosine" is_iteration_based: false max_epochs: 300 warmup_iterations: 25000 warmup_init_lr: 0.0002 cosine: max_lr: 0.002 min_lr: 0.0002 model: classification: name: "mobilevit" classifier_dropout: 0.1 n_classes: 1000 mit: mode: "small" ffn_dropout: 0.0 attn_dropout: 0.0 dropout: 0.1 number_heads: 4 no_fuse_local_global_features: false conv_kernel_size: 3 activation: name: "swish" normalization: name: "batch_norm_2d" momentum: 0.1 activation: name: "swish" layer: global_pool: "mean" conv_init: "kaiming_normal" linear_init: "trunc_normal" linear_init_std_dev: 0.02 ema: enable: false momentum: 0.0005 ddp: enable: true rank: 0 world_size: -1 dist_port: 30786 stats: name: [ "loss", "top1", "top5" ] checkpoint_metric: "top1" checkpoint_metric_max: true

I don't want to enable ema because that slows down training. Do you think I need to change something?

sacmehta commented 2 years ago

Which version of the cvnets library are you using? If you are using latest release, then I would suggest you to first try with cvnets_v0.1 library.

Also, I see that you are using 25000 warm-up updates, but in our paper, we used 3000.

If you are using the current version, then make sure that:

imagenet_opencv dataset is used instead of imagenet
For validation, resize size should be 288 (shorter dim) and center crop size is 256x256.

ck-amrahd commented 2 years ago

Hi @sacmehta I am using the old version and using default parameters. I used warmup iterations of 25,000 because in your case the learning rate scheduler has seen 1024 * 300 images when it reaches the max value and I have scaled that to the batch size of 128, to make sure that my model sees same number of images as yours when it reaches the maximum value of learning rate [train_iteration increases by batch in your code].

sacmehta commented 2 years ago

Hi @ck-amrahd , Thanks for confirming that. After you add --common.accum-freq 8, you do not need to scale the warm-up iterations because the effective batch size is 1024.

If you are setting this to --common.accum-freq 1, then you need to scale the warm-up iterations as well as learning rate.

Could you please provide the log file that includes dataset details, model details, scheduler,, etc. and logs upto 10 epochs?

ck-amrahd commented 2 years ago

Thank you @sacmehta . Sure, I can provide the log details. I will try without changing warmup iterations and accumulation frequency of 8.

sacmehta commented 2 years ago

@ck-amrahd You can share your logs from previous runs too.

ck-amrahd commented 2 years ago

Thank you @sacmehta, Please find the logs from the following file. tb_logs.zip

sacmehta commented 2 years ago

Could you please share the model, optimizer, sampler, dataset, etc. logs too that are printed at the beginning of the training?

ck-amrahd commented 2 years ago

@sacmehta I didn't save them. I will save them from the next run and will upload it. Thank you.

ck-amrahd commented 2 years ago

Hi @sacmehta Here's the log file for few runs. This time, it didn't converge (loss went to nan) log.zip .

sacmehta commented 2 years ago

It seems that LR is too high. Could you try with warm_up iterations of 30000 instead of 3000?

Also, it would be great if you can first try to reproduce the MobileViT-XXS model. I think with this model, you should be able to use a batch size of 128 (instead of 32).

ck-amrahd commented 2 years ago

HI @sacmehta That's what I did in the previous experiment (above comment 25k warmup iterations) and you suggested me to try warmup iterations of 3000.

apple / ml-cvnets

too sensitivve to hyperparameters? #20