Open xlzhou01 opened 1 month ago
It looks like you're encountering a compatibility issue with bfloat16 on your current CUDA device. As the error suggests, switching the data type to float16 should resolve this problem. If you're using Mamba, make sure your environment is set up to support the necessary CUDA features. If you have further questions or need assistance, feel free to reach out!
I set it to float16 (precision="16-mixed"), and then the following error occurred:
Monitored metric val_loss/dataloader_idx_0 = nan is not finite. Previous best value was inf. Signaling Trainer to stop. Epoch 0, global step 13900: 'val_loss/dataloader_idx_0' reached inf (best inf), saving model to '/data/SPMamba/Experiments/checkpoint/SPMamba-Libri2Mix/epoch=0.ckpt' as top 5
I use the noisy Libri2mix sub-dataset. And I try again:
I also tried running the clean sub-dataset of Libri2Mix later, and I encountered the same issue. Could it be related to the modified precision?
You might need to adjust the value of 'eps' used in the paper to match the precision you are working with. When using float16 (precision='16-mixed'), the limited numerical range can sometimes lead to instability, such as NaNs or Infs in the loss. Consider increasing 'eps' slightly to maintain numerical stability during training.
File "/home/.conda/envs/spmba/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 234, in init raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.') RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.