GPU usage - Githubissues

zzkkzz commented 10 months ago

In bin/train_model.py 214, the setting of pl.Trainer(). Maybe "devices=1" should be modified to something like "devices=[int(config["device"][-1])]" to set the right gpu ids

zzkkzz commented 10 months ago

Another problem only occurs on specific datasets, such as the solar, exchange and electricity datasets(code runs normally on the m4, traffic datasets). The command is "python bin/train_model.py -c configs/train_tsdiff/train_solar.yaml"，and error reporting is confusing：

Traceback (most recent call last): File "bin/train_model.py", line 278, in main(config=config, log_dir=args.out_dir) File "bin/train_model.py", line 217, in main trainer.fit(model, train_dataloaders=data_loader) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run results = self._run_stage() File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage self._run_train() File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train self.fit_loop.run() File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.on_advance_end() File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 295, in on_advance_end self.trainer._call_callback_hooks("on_train_epoch_end") File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1394, in _call_callback_hooks fn(self, self.lightning_module, *args, *kwargs) File "/x22221259/code/workspace/TSDiff/src/uncond_ts_diff/model/callback.py", line 258, in on_train_epoch_end forecasts_pytorch = list(forecast_it) File "/x22221259/code/workspace/TSDiff/src/uncond_ts_diff/predictor.py", line 25, in predict yield from self.forecast_generator( File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/gluonts/model/forecast_generator.py", line 156, in call outputs = predict_to_numpy(prediction_net, inputs) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/functools.py", line 875, in wrapper return dispatch(args[0].class)(args, kw) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/gluonts/torch/model/predictor.py", line 38, in _ return prediction_net(args).cpu().numpy() File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/x22221259/code/workspace/TSDiff/src/uncond_ts_diff/sampler/observation_guidance.py", line 165, in forward pred = self.guide(observation, observation_mask, features, base_scale) File "/x22221259/code/workspace/TSDiff/src/uncond_ts_diff/sampler/observation_guidance.py", line 229, in guide return self._reverse_diffusion( File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/x22221259/code/workspace/TSDiff/src/uncond_ts_diff/sampler/observation_guidance.py", line 216, in _reverse_diffusion seq = self.model.p_sample(seq, t, i, features) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "/x22221259/code/workspace/TSDiff/src/uncond_ts_diff/model/diffusion/_base.py", line 172, in p_sample predicted_noise = self.backbone(x, t, features) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/x22221259/code/workspace/TSDiff/src/uncond_ts_diff/arch/backbones.py", line 156, in forward x = self.input_init(input) # B, L ,C File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward input = module(input) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/x22221259/anaconda/envs/tsdiff/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())

Did I make some mistakes in setting ? This problem occurs when epochs reach around 49, so I think there may be something wrong with sampling.

abdulfatir commented 10 months ago

In bin/train_model.py 214, the setting of pl.Trainer(). Maybe "devices=1" should be modified to something like "devices=[int(config["device"][-1])]" to set the right gpu ids

You're right. We always tested on a machine with a single GPU, so this got overlooked. Thanks!

Did I make some mistakes in setting ? This problem occurs when epochs reach around 49, so I think there may be something wrong with sampling.

I am actually not sure about this. I did not face such an issue. Do you have a MWE by any chance? If not, I will try to start a training job on my end.

cc @marcelkollovieh

abdulfatir commented 10 months ago

@zzkkzz can you share the exact command/config that you're running?

abdulfatir commented 10 months ago

I ran this job and it works for me.

python bin/train_model.py -c configs/train_tsdiff/train_solar.yaml

Output:

Epoch 445/999 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/-- 0:00:04 • -:--:-- 22.72it/s

zzkkzz commented 10 months ago

Thank you for your reply! My second problem is about pytorch1.13.1 and anaconda environment, I solved this problem by installing pytorch1.12.1 separately. This problem is beacause installing pytorch1.13.1 will automatically download nvidia_cublas_cu11,nvidia_cuda_nvrtc_cu11,nvidia_cuda_runtime_cu11 and nvidia_cudnn_cu11 which will lead to some conflicts about cuda toolkit. So installing pytorch1.12.1 separately when creating the environment can avoid those conflicts. Thank you again for your work!

abdulfatir commented 10 months ago

Great! Closing this issue. Please open a new one, if you face other problems. :)

abdulfatir commented 10 months ago

Reopening to keep track of the GPU ID issue.

amazon-science / unconditional-time-series-diffusion

GPU usage #3