Open stevehuanghe opened 3 years ago
Hi,
thanks taking interest in this work.
The training hyper-parameters are (for stam_16) batch size 64, AdamW optimizer with weight decay 1e-3, 100 epochs with cosine annealing schedule and learning rate warm up (first 10 epochs). Base learning rate of 1e-5. And using model EMA.
For stam_64, same as above, except batch size: 16, and learning rate: 2.5e-6
The models were trained on single 8xV100 machine.
Hope you find this useful.
Hello,
This work is really inspiring, and thanks for sharing the code. Meanwhile, could you please also share the training hyper-parameters (e.g., learning rate, optimizer, warmup lr, warmup epochs, etc.)? I would really like to train the model to get a deeper understanding of the model.
Thanks, Steve