Closed TalalWasim closed 1 month ago
Hi! You can find the log here.
For the unstable training, you can try to handle it in some ways:
--opt-eps 1e-6 --opt-betas 0.9 0.98
.Hi,
Can you tell me what is the exact hyperparams that you used to produce the performance for the given checkpoint? Because I am also training on the same configuration with 8 gpus for consistency to reproduce the results.
Yes, but the Mamba is unstable and may also be NaN in my experiments. When encountering the problem, I often try reloading the best checkpoint and changing the random seed.
I see. Alright I will do as you said and will update here on the results
I have tried all the strategies I mentioned before. The first and third ways lead to similar results at the end, while smaller learning rates may more helpful for larger models.
One last quick question. To set fp32 training I just remove the "--bf16" argument or is there something else as well?
Yes, just remove --bf16
but add --no_amp
.
Thank you. I really appreciate your fast response.
Can you tell me what learning rates you use for other variants? Or is it the same as the one used in the train script?
Yes. It is the same as my training script. But in some GPU, it may lead to unstable training, and you need to change the hyperparameters :)
Hi,
I was trying to reproduce the imagenet results for the tiny variant. I am following the same training script as yours. But the loss diverges after a few epochs. Here is the respective section of the log file below:
{"train_lr": 0.007261435543935553, "train_loss": 4.564484205765602, "test_loss": 1.9566744327545167, "test_acc1": 54.23800136169434, "test_acc5": 79.6760025579834, "epoch": 60, "n_parameters": 7148008} {"train_lr": 0.007237022892527909, "train_loss": 4.546962348792033, "test_loss": 1.8673380441963672, "test_acc1": 56.38000125915527, "test_acc5": 80.9260028843689, "epoch": 61, "n_parameters": 7148008} {"train_lr": 0.007212255813388488, "train_loss": 6.306850623339415, "test_loss": 6.906790274381637, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 62, "n_parameters": 7148008} {"train_lr": 0.007187137022506653, "train_loss": 6.908678848200884, "test_loss": 6.906302499771118, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 63, "n_parameters": 7148008} {"train_lr": 0.007161669274440874, "train_loss": 6.908334856996169, "test_loss": 6.910482823848724, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 64, "n_parameters": 7148008} {"train_lr": 0.00713585536201674, "train_loss": 6.908083065006977, "test_loss": 6.909442615509033, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 65, "n_parameters": 7148008} {"train_lr": 0.007109698116020586, "train_loss": 6.907925617236358, "test_loss": 6.907901990413666, "test_acc1": 0.10000000186920166, "test_acc5": 0.5000000067520142, "epoch": 66, "n_parameters": 7148008} {"train_lr": 0.0070832004048892935, "train_loss": 6.907739882667859, "test_loss": 6.908897644281387, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 67, "n_parameters": 7148008} {"train_lr": 0.0070563651343954125, "train_loss": 6.9076923833061485, "test_loss": 6.908295905590057, "test_acc1": 0.10000000122070313, "test_acc5": 0.5000000061035156, "epoch": 68, "n_parameters": 7148008}
Did you notice something similar? Can you provide your log file for the training on tiny variant for comparison?
Kind regards,