Closed zzkkzz closed 8 months ago
Another problem only occurs on specific datasets, such as the solar, exchange and electricity datasets(code runs normally on the m4, traffic datasets). The command is "python bin/train_model.py -c configs/train_tsdiff/train_solar.yaml",and error reporting is confusing:
Traceback (most recent call last):
File "bin/train_model.py", line 278, in cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())
Did I make some mistakes in setting ? This problem occurs when epochs reach around 49, so I think there may be something wrong with sampling.
In bin/train_model.py 214, the setting of pl.Trainer(). Maybe "devices=1" should be modified to something like "devices=[int(config["device"][-1])]" to set the right gpu ids
You're right. We always tested on a machine with a single GPU, so this got overlooked. Thanks!
Did I make some mistakes in setting ? This problem occurs when epochs reach around 49, so I think there may be something wrong with sampling.
I am actually not sure about this. I did not face such an issue. Do you have a MWE by any chance? If not, I will try to start a training job on my end.
cc @marcelkollovieh
@zzkkzz can you share the exact command/config that you're running?
I ran this job and it works for me.
python bin/train_model.py -c configs/train_tsdiff/train_solar.yaml
Output:
Epoch 445/999 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/-- 0:00:04 • -:--:-- 22.72it/s
Thank you for your reply! My second problem is about pytorch1.13.1 and anaconda environment, I solved this problem by installing pytorch1.12.1 separately. This problem is beacause installing pytorch1.13.1 will automatically download nvidia_cublas_cu11,nvidia_cuda_nvrtc_cu11,nvidia_cuda_runtime_cu11 and nvidia_cudnn_cu11 which will lead to some conflicts about cuda toolkit. So installing pytorch1.12.1 separately when creating the environment can avoid those conflicts. Thank you again for your work!
Great! Closing this issue. Please open a new one, if you face other problems. :)
Reopening to keep track of the GPU ID issue.
In bin/train_model.py 214, the setting of pl.Trainer(). Maybe "devices=1" should be modified to something like "devices=[int(config["device"][-1])]" to set the right gpu ids