LibCity / Bigscity-LibCity

LibCity: An Open Library for Urban Spatial-temporal Data Mining
https://libcity.ai/
Apache License 2.0
916 stars 168 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #326

Closed Mardok95 closed 1 year ago

Mardok95 commented 1 year ago

Hi, I'm using Libcity on a Google Cloud VM with Debian 10 and cuda dev kit 11.0 with an Nvidia T4. I used this command: python run_model.py --task traj_loc_pred --model LSTPM --dataset gowalla and after the encoding phase i obtain this message:

2023-01-28 10:11:20,536 - INFO - start train
2023-01-30 19:44:32,921 - INFO - ==>Train Epoch:   0 Loss:373.83639 learning_rate:0.0001
2023-01-30 19:44:32,921 - INFO - start evaluate
Traceback (most recent call last):
  File "run_model.py", line 38, in <module>
    train=args.train, other_args=other_args)
  File "/home/adimartino95/Bigscity-LibCity/libcity/pipeline/pipeline.py", line 57, in run_model
    executor.train(train_data, valid_data)
  File "/home/adimartino95/Bigscity-LibCity/libcity/executor/traj_loc_pred_executor.py", line 43, in train
    avg_eval_acc, avg_eval_loss = self._valid_epoch(eval_dataloader, self.model)
  File "/home/adimartino95/Bigscity-LibCity/libcity/executor/traj_loc_pred_executor.py", line 142, in _valid_epoch 
    scores = model.predict(batch)
  File "/home/adimartino95/Bigscity-LibCity/libcity/model/trajectory_loc_prediction/LSTPM.py", line 168, in predict
    return self.forward(batch)
  File "/home/adimartino95/Bigscity-LibCity/libcity/model/trajectory_loc_prediction/LSTPM.py", line 114, in forward
    sequence_emb, (h2, c2) = self.lstmcell_history(sequence_emb, (h2, c2))
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/rnn.py", line 582, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Any suggestion?

WenMellors commented 1 year ago

Dose the pytorch version match the cuda version?

Mardok95 commented 1 year ago

The python verison is the 3.7.3, pytorch version is >>> print(torch.__version__) 1.7.1+cu110 and the cuda version is 11.0.228

WenMellors commented 1 year ago

It seems ok. I'm afraid that I have no idea about this error.

Mardok95 commented 1 year ago

It seems ok. I'm afraid that I have no idea about this error.

Don't worry. I'll find out something. Thank you!