OpenGVLab / VideoMamba

VideoMamba: State Space Model for Efficient Video Understanding
https://arxiv.org/abs/2403.06977
Apache License 2.0
660 stars 47 forks source link

Imagenet training diverges after some epochs #34

Closed TalalWasim closed 1 month ago

TalalWasim commented 2 months ago

Hi,

I was trying to reproduce the imagenet results for the tiny variant. I am following the same training script as yours. But the loss diverges after a few epochs. Here is the respective section of the log file below:

{"train_lr": 0.007261435543935553, "train_loss": 4.564484205765602, "test_loss": 1.9566744327545167, "test_acc1": 54.23800136169434, "test_acc5": 79.6760025579834, "epoch": 60, "n_parameters": 7148008} {"train_lr": 0.007237022892527909, "train_loss": 4.546962348792033, "test_loss": 1.8673380441963672, "test_acc1": 56.38000125915527, "test_acc5": 80.9260028843689, "epoch": 61, "n_parameters": 7148008} {"train_lr": 0.007212255813388488, "train_loss": 6.306850623339415, "test_loss": 6.906790274381637, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 62, "n_parameters": 7148008} {"train_lr": 0.007187137022506653, "train_loss": 6.908678848200884, "test_loss": 6.906302499771118, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 63, "n_parameters": 7148008} {"train_lr": 0.007161669274440874, "train_loss": 6.908334856996169, "test_loss": 6.910482823848724, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 64, "n_parameters": 7148008} {"train_lr": 0.00713585536201674, "train_loss": 6.908083065006977, "test_loss": 6.909442615509033, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 65, "n_parameters": 7148008} {"train_lr": 0.007109698116020586, "train_loss": 6.907925617236358, "test_loss": 6.907901990413666, "test_acc1": 0.10000000186920166, "test_acc5": 0.5000000067520142, "epoch": 66, "n_parameters": 7148008} {"train_lr": 0.0070832004048892935, "train_loss": 6.907739882667859, "test_loss": 6.908897644281387, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 67, "n_parameters": 7148008} {"train_lr": 0.0070563651343954125, "train_loss": 6.9076923833061485, "test_loss": 6.908295905590057, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 68, "n_parameters": 7148008}

Did you notice something similar? Can you provide your log file for the training on tiny variant for comparison?

Kind regards,

Andy1621 commented 2 months ago

Hi! You can find the log here.

For the unstable training, you can try to handle it in some ways:

  1. Use float32 instead of bfloat16;
  2. Use a smaller learning rate;
  3. Adapt the hyperparameters of the optimizer, like --opt-eps 1e-6 --opt-betas 0.9 0.98.
TalalWasim commented 2 months ago

Hi,

Can you tell me what is the exact hyperparams that you used to produce the performance for the given checkpoint? Because I am also training on the same configuration with 8 gpus for consistency to reproduce the results.

Andy1621 commented 2 months ago

Yes, but the Mamba is unstable and may also be NaN in my experiments. When encountering the problem, I often try reloading the best checkpoint and changing the random seed.

TalalWasim commented 2 months ago

I see. Alright I will do as you said and will update here on the results

Andy1621 commented 2 months ago

I have tried all the strategies I mentioned before. The first and third ways lead to similar results at the end, while smaller learning rates may more helpful for larger models.

TalalWasim commented 2 months ago

One last quick question. To set fp32 training I just remove the "--bf16" argument or is there something else as well?

Andy1621 commented 2 months ago

Yes, just remove --bf16 but add --no_amp.

TalalWasim commented 2 months ago

Thank you. I really appreciate your fast response.

TalalWasim commented 1 month ago

Can you tell me what learning rates you use for other variants? Or is it the same as the one used in the train script?

Andy1621 commented 1 month ago

Yes. It is the same as my training script. But in some GPU, it may lead to unstable training, and you need to change the hyperparameters :)