Open StepNeverStop opened 1 month ago
Hello, thank you for this detailed report.
That's very useful to know, I will update the repo with the possibility of using a manually defined softplus
function.
On CUDA devices, I didn't have any problem training mamba.py
models, but yes as you said this could differ on other devices, and one of the goal of this repo is to allow Mamba training on non-CUDA devices, so it's kind of a big deal.
When I was doing experiments, I encountered the problem of loss being nan. The same problem was also mentioned in mamba-ssm. After debugging step by step, I found that this was mainly caused by incorrect calculation when executing the operation on some devices (such as macos 14.4.1 m3 pro for me).
As we all know,
softplus
should output a value greater than 0, but in my calculation below, it can be seen that it can actually output a negative value, at line. This is a very accidental error. It may be alleviated by adjusting the learning rate, initialization, etc., but the calculation error on some devices makes it impossible to eradicate the problem:and the outputs:
This may be due to the difference in PyTorch's calculations on different devices, or it may be due to its own bugs. There is currently no perfect solution, unless you manually rewrite a
softplus
operation. At least it can be guaranteed that this problem will not occur when doing experiments under Linux CUDA.Write a
softplus
function and try it out:The result looks more normal: