CUDA Device does not support bfloat16

JusperLee / SPMamba

Apache License 2.0

133 stars 17 forks source link

CUDA Device does not support bfloat16 #15

Open xlzhou01 opened 1 month ago

xlzhou01 commented 1 month ago

File "/home/.conda/envs/spmba/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 234, in init raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.') RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

JusperLee commented 1 month ago

It looks like you're encountering a compatibility issue with bfloat16 on your current CUDA device. As the error suggests, switching the data type to float16 should resolve this problem. If you're using Mamba, make sure your environment is set up to support the necessary CUDA features. If you have further questions or need assistance, feel free to reach out!

xlzhou01 commented 1 month ago

I set it to float16 （precision="16-mixed"）, and then the following error occurred:

Monitored metric val_loss/dataloader_idx_0 = nan is not finite. Previous best value was inf. Signaling Trainer to stop. Epoch 0, global step 13900: 'val_loss/dataloader_idx_0' reached inf (best inf), saving model to '/data/SPMamba/Experiments/checkpoint/SPMamba-Libri2Mix/epoch=0.ckpt' as top 5

I use the noisy Libri2mix sub-dataset. And I try again:

I also tried running the clean sub-dataset of Libri2Mix later, and I encountered the same issue. Could it be related to the modified precision?

JusperLee commented 1 month ago

You might need to adjust the value of 'eps' used in the paper to match the precision you are working with. When using float16 (precision='16-mixed'), the limited numerical range can sometimes lead to instability, such as NaNs or Infs in the loss. Consider increasing 'eps' slightly to maintain numerical stability during training.