JukeBox Augmentation Triggers CUDNN Error

Gariscat commented 1 year ago

Greetings!

Thank you for releasing this repo. We were trying to do an inference using the GPU (JukeBox) version on an EDM dataset of ours. We rent a bare-metal machine on Featurize, with an RTX A4000 (16G memory). However, it produced the following error which seemed to have something to do with CUDNN.

We would really appreciate it if you could provide any advice. Thanks again :)

(base) ➜  ~ sudo ./sheetsage.sh -j work/test.wav        
Copying input file work/test.wav to container as ./output/input
Running Sheet Sage via Docker with args: -j /sheetsage/output/input
INFO:root:Loading audio from /sheetsage/output/input
INFO:root:DETECTING_BEATS
INFO:root:EXTRACTING_FEATURES
INFO:root:Feature extraction w/ Jukebox could take several minutes.
  0%|                                                                                                                                              | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/sheetsage/sheetsage/infer.py", line 851, in <module>
    tqdm=tqdm,
  File "/sheetsage/sheetsage/infer.py", line 681, in sheetsage
    audio_path_or_bytes, input_feats, tertiaries_times, chunks_tertiaries, tqdm
  File "/sheetsage/sheetsage/infer.py", line 367, in _extract_features
    fr, feats = extractor(audio_path, offset=offset, duration=duration)
  File "/sheetsage/sheetsage/representations/jukebox.py", line 233, in __call__
    codified_audio = self.codify_audio(audio)
  File "/sheetsage/sheetsage/representations/jukebox.py", line 132, in codify_audio
    return self._codify_audio(audio, tqdm=tqdm)
  File "/sheetsage/sheetsage/representations/jukebox.py", line 126, in _codify_audio
    context_codified = self.vqvae.encode(context)[-1].view(-1).cpu().numpy()
  File "/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/vqvae.py", line 141, in encode
    zs_i = self._encode(x_i, start_level=start_level, end_level=end_level)
  File "/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/vqvae.py", line 132, in _encode
    x_out = encoder(x_in)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/encdec.py", line 80, in forward
    x = level_block(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/jukebox/vqvae/encdec.py", line 26, in forward
    return self.model(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

chrisdonahue commented 1 year ago

Hello. I'm not 100% sure what's causing this. Containerizing Jukebox GPU support via Docker has unfortunately always been quite brittle.

What version of CUDA / cuDNN are you on on the host machine? One possible thing to try is upgrading (or downgrading?) these packages. I think I was using CUDA 11 and CUDNN 8 on the host machine back last I ran feature extraction from Jukebox

matthewliuswims commented 8 months ago

I'm getting the exact same error message; and I have CUDA 11 and CUDNN 8 on the host machine (Ubuntu 22) as you can see in the below terminal output. FWIW, other folks seemed to have ran into this issue.

I'd be happy to make a PR with an updated README if we can figure this error out. I think we're close :crossed_fingers: and I think it'd help lots of other folks to find a solution

matthewliu:~/Desktop/sheetsage$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

matthewliu~/Desktop/sheetsage$ dpkg -l | grep cudnn
ii  cudnn-local-repo-ubuntu2204-8.9.6.50       1.0-1                                   amd64        cudnn-local repository configuration files
ii  libcudnn8                                  8.9.6.50-1+cuda12.2                     amd64        cuDNN runtime libraries
ii  libcudnn8-dev                              8.9.6.50-1+cuda12.2                     amd64        cuDNN development libraries and headers
ii  libcudnn8-samples                          8.9.6.50-1+cuda12.2                     amd64        cuDNN samples

cc @elloza and @tanchihpin0517 from the other post so we can consolidate the problem+solution in 1 thread!

As an aside, @chrisdonahue, thank you SO much for making this code open source :smile: I can tell you put a lot of time making the code polished (with the dockerization, comprehensive README, scripts and etc.)

But I did have a high-level question. I read over the paper, and I understood Jukebox to only be used as part of the training step; I wasn't aware of Jukebox being used in the inference step (I very well could be missing something here). Did I miss that aspect in the paper?

chrisdonahue commented 8 months ago

Hi @matthewliuswims . Sorry this is still causing problems - I'm not sure how to replicate / debug. Will happily review a PR though if someone is able to resolve...

Re: high-level question. Our best model takes features computed from Jukebox (intermediate layer activations) as inputs (as opposed to common features like Mel spectrograms). This behavior is enabled with ./sheetsage.sh -j. You should be able to use the Mel spectrogram-based model without Jukebox (though it's not nearly as good).

XaryLee commented 8 months ago

Hi @chrisdonahue . I inspected the Docker image sheetsage-dev you provided and I found this line: NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419 which requires a GPU with Tesla architecture I guess. In my testing, the project can run successfully on a Tesla T4 GPU, but encounters the same issue on an A100 GPU with Ampere architecture. And according to your paper, the model was trained on a K40 GPU, which is also with Tesla architecture. So I guess this might be the reason? @matthewliuswims @Gariscat Could you kindly inform me of the architecture of your GPU? It may help to validate my hypothesis.

matthewliuswims commented 8 months ago

Hi @XaryLee - I'm using GeForce RTX 3060 which is based on the NVIDIA Ampere architecture so your hypothesis certainly makes sense and is consistent with everything we've seen. Hmm, in terms of next steps, I'm not sure I have the CUDA expertise to be able to immediately dive into the bowels of this project to make it compatible with the Ampere architecture, but who knows...

In my testing, the project can run successfully on a Tesla T4 GPU

I'm curious from your own experience, how much better the results (qualitatively) were for you compared to running the model without the -j flag.

XaryLee commented 8 months ago

Thanks for your information @matthewliuswims . To my knowledge, the use of Jukebox features as representation in Sheet Sage significantly improves the quality of results compared to its non-Jukebox version. This improvement is observed in various aspects, including pitch, rhythm, and more. So I think for high-quality transcription, using Jukebox is necessary.

And I attempted to re-build the Docker image using the Dockerfile provided in the source code but failed. I found that the outdated package versions specified in the Dockerfile might cause compatibility issues with recent GPU architectures. For example, A100 GPUs require a CUDA version >= 11.0 and a torch version >= 1.7. However, in the Dockerfile, the PyTorch version is 1.4 with CUDA 10.4. So, using older GPUs may work, or alternatively, I would appreciate it if the code could be updated to ensure compatibility with the latest architectures.

matthewliuswims commented 8 months ago

As per the above, I was able to get further by running this repo on a g4dn.xlarge which has the Tesla architecture for the GPU 😄. This gets rid of the error that was in the original post. But, the command doesn't have any kind of successful output (not does it give any indication that there was an error)

untu@ip-172-31-68-207:~/sheetsage$ ./sheetsage.sh -j happy-birthday-short.mp3
Copying input file happy-birthday-short.mp3 to container as ./output/input
Running Sheet Sage via Docker with args: -j /sheetsage/output/input
INFO:root:Loading audio from /sheetsage/output/input
INFO:root:DETECTING_BEATS
INFO:root:EXTRACTING_FEATURES
INFO:root:Feature extraction w/ Jukebox could take several minutes.
ubuntu@ip-172-31-68-207:~/sheetsage$

It seems like the script never gets past this step. The odd part is that it doesn't seem to be actually hanging for that long 🤔 and the audio file I gave is only 7 seconds. Sorry to bother you again @XaryLee but since you actually have ran the augmentation successfully, I'm curious if you had ran into this same issue.

XaryLee commented 8 months ago

Hi @matthewliuswims . Hmm the program exits without any error reporting is indeed an unusual issue. I have never encountered this before. But based on my experience, during the initial run of the program, it will download the Jukebox and Sheetsage model from the cloud, which are approximately 10GB and may take some time depending on your server's Internet speed. I recall it took me about 10 mins. And the time cost is independent of the length of the input song. Perhaps with some patient waiting, the program may run successfully. Additionally, I am curious about the number of CPU cores on your server. I am running on an eight-core CPU device, and according to one of my research partners, the program cannot run on a four-core CPU, although I have not personally tested this. Hope these can help you with the problem.

As per the above, I was able to get further by running this repo on a g4dn.xlarge which has the Tesla architecture for the GPU 😄. This gets rid of the error that was in the original post. But, the command doesn't have any kind of successful output (not does it give any indication that there was an error)
untu@ip-172-31-68-207:~/sheetsage$ ./sheetsage.sh -j happy-birthday-short.mp3
Copying input file happy-birthday-short.mp3 to container as ./output/input
Running Sheet Sage via Docker with args: -j /sheetsage/output/input
INFO:root:Loading audio from /sheetsage/output/input
INFO:root:DETECTING_BEATS
INFO:root:EXTRACTING_FEATURES
INFO:root:Feature extraction w/ Jukebox could take several minutes.
ubuntu@ip-172-31-68-207:~/sheetsage$
It seems like the script never gets past this step. The odd part is that it doesn't seem to be actually hanging for that long 🤔 and the audio file I gave is only 7 seconds. Sorry to bother you again @XaryLee but since you actually have ran the augmentation successfully, I'm curious if you had ran into this same issue.

XaryLee commented 7 months ago

Hi @matthewliuswims . I hope your issue has been resolved. I have recently conducted a thorough examination of the Dockerfile and the environment upon which Sheet Sage depends. Subsequently, I updated the Dockerfile, rebuilt the Docker image, and have successfully run the Jukebox version of Sheet Sage on my A100 GPU-equipped machine. In the refreshed Dockerfile, I have upgraded several outdated libraries and resolved some version conflicts present in the open-source code. Additionally, I have revised the shell scripts to accommodate these changes. Presently, I have updated my forked repository with these modifications and have submitted a pull request. I would greatly appreciate it if the maintainers @chrisdonahue, could review my recent pull request.

chrisdonahue / sheetsage

JukeBox Augmentation Triggers CUDNN Error #27