Open sbrother opened 9 months ago
I encountered the same issue with H100. It is also present on different versions of CUDA (I tried Cuda 12, 11.8, 11.7) and Pytorch (I tried 2.0.0 and 2.0.1).
Found a solution. The main issue is that H100 is not supported by the correct standard PyTorch versions. For torch 2.0.1, you will typically see the warning:
NVIDIA H100 80GB HBM3 with CUDA capability sm_90 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
Try using the NGC PyTorch Docker image version 22.10 or later: nvcr.io/nvidia/pytorch:xx.xx-py3
.
Install torchaudio
from sources, keeping the same version of PyTorch that comes with the container.
pip install --no-deps torchaudio
pip install ninja
git clone https://github.com/pytorch/audio
cd audio/
python setup.py develop --user
Install demucs
, encodec
and xformers
also from sources keeping the correct torch
version.
Install audiocraft with other dependencies.
I seem to be having H100 specific problems as well. Is this still potentially an incompatibility? In my case I'm frequently timing out with thread deadlock when running slurm.
Just to note; single-node jobs (with dora run -d
) run without issue.
I've been struggling trying to debug
RuntimeError: CUDA error: an illegal instruction was encountered
for a while, until I tried to train on an A100 card which works. Steps to reproduceOn the H100 box this results in the stacktrace:
While on the A100 box it completes successfully.
Both instances are Ubuntu 22.04, Python 3.9, Torch 2.0.1
The failing H100 instance has CUDA 11.8:
The succeeding A100 instance has CUDA 11.5: