Mael-zys / T2M-GPT

(CVPR 2023) Pytorch implementation of “T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations”
https://mael-zys.github.io/T2M-GPT/
Apache License 2.0
571 stars 51 forks source link

Can this really run with just 32GB vram? #45

Open YOUNG-WLI opened 1 year ago

YOUNG-WLI commented 1 year ago

I tried VQ-VAE training with the parameters suggested in the document (However, since we are building the humanML3D dataset, we used the '--dataname kit' option.). However, 'RuntimeError: Unable to find a valid cuDNN algorithm to run convolution' occurred. This is known as the error that occurs when running out of vram. This error occurred even though the GPU used for training was an RTX A6000 with 48GB of VRAM, which is larger than the suggested 32GB of VRAM. What's even more puzzling is that the same error occurs even when the batch size is drastically lowered to 1. The detailed error message is as follows:

Traceback (most recent call last):
  File "train_vq.py", line 116, in <module>
    loss.backward()
  File "/root/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
xu0166 commented 1 month ago

I encountered the same issue because the default installation had the CUDA version "cu101" which was incompatible. You can reinstall with CUDA version "cu111" using the following command:

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html