Training musicgen seems to be incompatible with Nvidia H100 card with CUDA 11.8

sbrother commented 9 months ago

I've been struggling trying to debug RuntimeError: CUDA error: an illegal instruction was encountered for a while, until I tried to train on an A100 card which works. Steps to reproduce

dora run solver=musicgen/debug compression_model_checkpoint=//pretrained/facebook/encodec_32khz transformer_lm.n_q=4 transformer_lm.card=2048

On the H100 box this results in the stacktrace:

Error executing job with overrides: ['solver=musicgen/debug', 'compression_model_checkpoint=//pretrained/facebook/encodec_32khz', 'transformer_lm.n_q=4', 'transformer_lm.card=2048']
Traceback (most recent call last):
  File "/home/sbrother/audiocraft/audiocraft/train.py", line 146, in main
    return solver.run()
  File "/home/sbrother/audiocraft/audiocraft/solvers/base.py", line 497, in run
    self.run_epoch()
  File "/home/sbrother/audiocraft/audiocraft/solvers/musicgen.py", line 572, in run_epoch
    super().run_epoch()
  File "/home/sbrother/audiocraft/audiocraft/solvers/base.py", line 477, in run_epoch
    self.run_stage('train', self.train)
  File "/home/sbrother/venv/lib/python3.9/site-packages/flashy/solver.py", line 199, in run_stage
    metrics = method(*args, **kwargs)
  File "/home/sbrother/audiocraft/audiocraft/solvers/musicgen.py", line 585, in train
    return super().train()
  File "/home/sbrother/audiocraft/audiocraft/solvers/base.py", line 561, in train
    return self.common_train_valid('train')
  File "/home/sbrother/audiocraft/audiocraft/solvers/base.py", line 542, in common_train_valid
    metrics = self.run_step(idx, batch, metrics)
  File "/home/sbrother/audiocraft/audiocraft/solvers/musicgen.py", line 364, in run_step
    loss.backward()
  File "/home/sbrother/venv/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/sbrother/venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal instruction was encountered

While on the A100 box it completes successfully.

Both instances are Ubuntu 22.04, Python 3.9, Torch 2.0.1

The failing H100 instance has CUDA 11.8:

# This fails
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

The succeeding A100 instance has CUDA 11.5:

# This works
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

l11ama commented 9 months ago

I encountered the same issue with H100. It is also present on different versions of CUDA (I tried Cuda 12, 11.8, 11.7) and Pytorch (I tried 2.0.0 and 2.0.1).

l11ama commented 9 months ago

Found a solution. The main issue is that H100 is not supported by the correct standard PyTorch versions. For torch 2.0.1, you will typically see the warning: NVIDIA H100 80GB HBM3 with CUDA capability sm_90 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.

Try using the NGC PyTorch Docker image version 22.10 or later: nvcr.io/nvidia/pytorch:xx.xx-py3.

Install torchaudio from sources, keeping the same version of PyTorch that comes with the container.

pip install --no-deps torchaudio
pip install ninja
git clone https://github.com/pytorch/audio
cd audio/
python setup.py develop --user

Install demucs, encodec and xformers also from sources keeping the correct torch version.

Install audiocraft with other dependencies.

jbm-composer commented 3 months ago

I seem to be having H100 specific problems as well. Is this still potentially an incompatibility? In my case I'm frequently timing out with thread deadlock when running slurm.

Just to note; single-node jobs (with dora run -d) run without issue.

facebookresearch / audiocraft

Training musicgen seems to be incompatible with Nvidia H100 card with CUDA 11.8 #290