Open songjie1121 opened 2 months ago
please try updating your pytorch and cuda to latest version :) and seems your GPU might be unstable. can reboot the node
Thank you very much for your reply. I have replaced the attention used with RWKV_Tmix_x060_state of rwkv6 on this official website. However, it is very strange that the loss of the validation set during the training process will suddenly increase. This is the loss curve and the configuration of RWKV_Tmix_x060_state used. In addition, I found that if the program is interrupted during training and continues training from the checkpoint, the memory usage will double. I would like to ask the author if there are any possible solutions to the above problems? Looking forward to the author's answer very much. Thank you!!
I'm planning to apply rwkv in my ASR model, but once I use rwkv's module it generates this error and only after the program has been trained for some time, is there any related solution idea or solution?The code inside rwkv uses Ali's previous source code without modification, and is simply called to replace the attention span