Problem with CUDA - Githubissues

friki67 commented 6 months ago

Hello. I was having a problem with the 239 characters limit in Spanish (I've read an issue and a discussion about this thing in French), so I updated epub2tts from 2.2.14 to 2.3.4, just reinstalling from Github.

Now I'm getting a CUDA related error when trying an epub conversion.

Using GPU
VRAM: 8506114048
Loading model: /home/ubuntu/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2
 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
 > Using model: xtts
[2024-01-06 17:27:14,703] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-06 17:27:15,505] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unknown
[2024-01-06 17:27:15,507] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2024-01-06 17:27:15,507] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2024-01-06 17:27:15,507] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/epub2tts", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/epub2tts.py", line 724, in main
    mybook.read_book(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/epub2tts.py", line 379, in read_book
    self.model.load_checkpoint(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 783, in load_checkpoint
    self.gpt.init_gpt_for_inference(kv_cache=self.args.kv_cache, use_deepspeed=use_deepspeed)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt.py", line 224, in init_gpt_for_inference
    self.ds_engine = deepspeed.init_inference(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 158, in __init__
    self._apply_injection_policy(config)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 418, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 342, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 586, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 622, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 298, in replace_fn
    new_module = replace_with_policy(child,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 247, in replace_with_policy   
    _container.create_module()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/containers/gpt2.py", line 20, in create_module
    self.module = DeepSpeedGPTInference(_config, mp_group=self.mp_group)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_gpt.py", line 20, in __init__  
    super().__init__(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 58, in __init__
    inference_module = builder.load()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 458, in load
    return self.jit_load(verbose)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 461, in jit_load
    if not self.is_compatible(verbose):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/ops/op_builder/transformer_inference.py", line 29, in is_compatible  
    sys_cuda_major, _ = installed_cuda_version()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 50, in installed_cuda_version
    raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)

No other changes or updates have been done, in the host or in the container (this is running in a LXC Ubuntu container)

Any easy fix?

And I want to comment that the with the last version using DeepSpeed and my modest GPU (GTX 1070) the conversion speed ratio is slightly under 1!!! Amazing!

aedocw commented 6 months ago

Could you try pip show deepspeed and share the version you're using? I'm on 0.12.6 and have had no problems. I saw discussion of setting CUDA_HOME here, but I haven't had any issues so I did not follow those steps so I can't say if they're useful or not. I would start off by updating deepspeed if it's not at 0.12.6 and if that doesn't work, figure out what CUDA_HOME should be and make sure that environment variable is exported.

Please update this with what you find in case others hit this, and there may be other folks that have seen this who might provide some guidance as well.

friki67 commented 6 months ago

Hello. pip show deepspeed

Name: deepspeed
Version: 0.12.6
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /home/ubuntu/.local/lib/python3.10/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, pynvml, torch, tqdm
Required-by: epub2tts

This is really strange. Yesterday's version was working using deepspeed. Now it gives this error. If I try export CUDA_HOME=/usr/local/cuda then I get

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'

So it seems that now I need to install CUDA toolkit?

aedocw commented 6 months ago

Maybe? I had CUDA toolkit installed from other stuff, and I don't have a clean environment to test from unfortunately.

I can't think of anything that changed yesterday that would have triggered this. The only big change was just to epub2tts really, where it switched which method it used Coqui TTS if you are using one of the studio voices (basically using the same streaming method that it was already using for XTTS).

I wonder if pip install . --update pulled in something new from TTS or one of their requirements? Sorry I can't be of more help with this. If I get a chance I'll see about spinning up a VM on my GPU machine to see what happens if I start clean.

danielw97 commented 6 months ago

Hi, I've been working with this over the last few days after aquiring a new system, with a better gpu and getting setup on wsl as my main environment for running this. I ended up using miniconda to setup my python environment for epub2tts, as even though I had the cuda toolkit installed I was getting errors with deepspeed although everything else seemed to work. Nice thing with using anaconda to setup the cuda and torch environment is that it seems to handle the either library linking or dependency issues and allows things to run nicely. Not sure if this is any help, although I'm happy to provide some more detail if that would help getting you setup.

aedocw commented 6 months ago

In case I didn't mention this anywhere else (now that I think about it, I probably did not).

I added a flag "--no-deepspeed", which disables use of deepspeed even if it finds that package is installed in the environment. Could you give that a try and see if you're able to use GPU just without deepspeed? It will help with troubleshooting this. Ultimately we'll figure it out so you get back to what was working (gpu + deepspeed).

Nikanoru commented 6 months ago

In case I didn't mention this anywhere else (now that I think about it, I probably did not).

I added a flag "--no-deepspeed", which disables use of deepspeed even if it finds that package is installed in the environment. Could you give that a try and see if you're able to use GPU just without deepspeed? It will help with troubleshooting this. Ultimately we'll figure it out so you get back to what was working (gpu + deepspeed).

Since the other user did not respond to your "try with --no-deepspeed" suggestion, I just tried it and it does indeed work it seems. It's at 25% currently, I will update once the process finished and I listened to the file.

Edit: The process finished successfully and the audio file sounds very nice :) Thank you!

friki67 commented 6 months ago

In case I didn't mention this anywhere else (now that I think about it, I probably did not).

I added a flag "--no-deepspeed", which disables use of deepspeed even if it finds that package is installed in the environment. Could you give that a try and see if you're able to use GPU just without deepspeed? It will help with troubleshooting this. Ultimately we'll figure it out so you get back to what was working (gpu + deepspeed).

Thank you very much. As @Nikanoru said, the flag is working. When I have some spare time, I will try it in a new container and tell you what happens.

friki67 commented 6 months ago

Hello again. I set up a new LXC(LXD) container, giving GPU permissions etc., installed all dependencies, and then installing epub2tts last version.

Again, the CUDA message. Using --no-deepspeed worked.

Then I installed CUDA toolkit inside the container. It worked. The process took 248 min + multiplex and so on, and the duration of the generated audio is 376 min aprox.

So my solution was to install CUDA toolkit to use DeepSpeed.

Thank you very much.

Nikanoru commented 5 months ago

Hello again. I set up a new LXC(LXD) container, giving GPU permissions etc., installed all dependencies, and then installing epub2tts last version.

Again, the CUDA message. Using --no-deepspeed worked.

Then I installed CUDA toolkit inside the container. It worked. The process took 248 min + multiplex and so on, and the duration of the generated audio is 376 min aprox.

So my solution was to install CUDA toolkit to use DeepSpeed.

Thank you very much.

Hey thank you for your message.

Can you go into more detail about "installed CUDA toolkit inside the container"? I am very new to Ubuntu/Linux and I had the same issue as you.

friki67 commented 5 months ago

Can you go into more detail about "installed CUDA toolkit inside the container"? I am very new to Ubuntu/Linux and I had the same issue as you. I went to https://developer.nvidia.com/cuda-downloads?target_os=Linux and make my choice (https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network) then followed instructions
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3
restart the container (or the computer) and it is working.

I think you need NVIDIA proprietary drivers installed. In my case I have them installed in the host and set my container to allow it to use them and the GPU.

Nikanoru commented 5 months ago

Can you go into more detail about "installed CUDA toolkit inside the container"? I am very new to Ubuntu/Linux and I had the same issue as you. I went to https://developer.nvidia.com/cuda-downloads?target_os=Linux and make my choice (https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network) then followed instructions
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3
restart the container (or the computer) and it is working.

I think you need NVIDIA proprietary drivers installed. In my case I have them installed in the host and set my container to allow it to use them and the GPU.

Thank you so much! Your instructions work :) I did not even have to reboot my system. Deepspeed cut down my time from 5:09 to 2:00! Nice :)

aedocw commented 5 months ago

Documentation update should cover this now.

aedocw / epub2tts

Problem with CUDA #168