Please tell me how to solve the error reported during the use of rwkv ”CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`“

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

12.65k stars 861 forks source link

Please tell me how to solve the error reported during the use of rwkv ”CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`“ #253

Open songjie1121 opened 2 months ago

songjie1121 commented 2 months ago

I'm planning to apply rwkv in my ASR model, but once I use rwkv's module it generates this error and only after the program has been trained for some time, is there any related solution idea or solution?The code inside rwkv uses Ali's previous source code without modification, and is simply called to replace the attention span

BlinkDL commented 2 months ago

please try updating your pytorch and cuda to latest version :) and seems your GPU might be unstable. can reboot the node

songjie1121 commented 2 months ago

Thank you very much for your reply. I have replaced the attention used with RWKV_Tmix_x060_state of rwkv6 on this official website. However, it is very strange that the loss of the validation set during the training process will suddenly increase. This is the loss curve and the configuration of RWKV_Tmix_x060_state used. In addition, I found that if the program is interrupted during training and continues training from the checkpoint, the memory usage will double. I would like to ask the author if there are any possible solutions to the above problems? Looking forward to the author's answer very much. Thank you!! 微信截图_20240907143809 微信截图_20240907143635