152334H / tortoise-tts-fast

Fast TorToiSe inference (5x or your money back!)
GNU Affero General Public License v3.0
771 stars 179 forks source link

CUDA out of memory #38

Open KaleidoscopicPrism opened 1 year ago

KaleidoscopicPrism commented 1 year ago

I know I've been a little bit obnoxious with the issues but here's another one (which will hopefully be my last)

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ E:\tortoise-tts-fast\tortoise\do_tts.py:42 in │ │ │ │ 39 │ kwargs = nullable_kwargs(args) │ │ 40 │ os.makedirs(args.output_path, exist_ok=True) │ │ 41 │ │ │ ❱ 42 │ tts = TextToSpeech( │ │ 43 │ │ models_dir=args.model_dir, │ │ 44 │ │ high_vram=args.high_vram, │ │ 45 │ │ kv_cache=args.kv_cache, │ │ │ │ E:\tortoise-tts-fast\tortoise\api.py:388 in init │ │ │ │ 385 │ │ if high_vram: │ │ 386 │ │ │ self.autoregressive = self.autoregressive.to(self.device) │ │ 387 │ │ │ self.diffusion = self.diffusion.to(self.device) │ │ ❱ 388 │ │ │ self.clvp = self.clvp.to(self.device) │ │ 389 │ │ │ self.vocoder = self.vocoder.to(self.device) │ │ 390 │ │ self.high_vram = high_vram │ │ 391 │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:989 in to │ │ │ │ 986 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_format) │ │ 987 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No │ │ 988 │ │ │ │ ❱ 989 │ │ return self._apply(convert) │ │ 990 │ │ │ 991 │ def register_backward_hook( │ │ 992 │ │ self, hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]] │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:664 in _apply │ │ │ │ 661 │ │ │ # track autograd history of param_applied, so we have to use │ │ 662 │ │ │ # with torch.no_grad(): │ │ 663 │ │ │ with torch.no_grad(): │ │ ❱ 664 │ │ │ │ param_applied = fn(param) │ │ 665 │ │ │ should_use_set_data = compute_should_use_set_data(param, param_applied) │ │ 666 │ │ │ if should_use_set_data: │ │ 667 │ │ │ │ param.data = param_applied │ │ │ │ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:987 in convert │ │ │ │ 984 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │ │ 985 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() els │ │ 986 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_format) │ │ ❱ 987 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No │ │ 988 │ │ │ │ 989 │ │ return self._apply(convert) │ │ 990 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.45 GiB already allocated; 0 bytes free; 3.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Ph0rk0z commented 1 year ago

You are simply out of memory.. this isn't so much an issue. Set it to low vram.

eloop001 commented 1 year ago

This is a dirt fix, I admit. Your problem is related to "batch size", in this case, the length of each chunk the input text is splitted into. If the output result is not compromised, I will suggest you adjust the text_split: Optional[str] = None """How big chunks to split the text into, in the format ,.""" option when calling tortouse_tts.py.

if you really need longer chunks, because the narration would otherwise be weird, you can experiment with lower values in api.py, line 174, def pick_best_batch_size_for_gpu(): set the values lower, you can initially set them to 1. consider the line: elif availableGb > 7: return 4 That would be you. set it to return 1. You would probably need to do both adjustments. 4GB is not a lot. I have to be VERY careful, not to rum out om memory, running on the 20GB A10 or even the 40GB A100 cloud servers. It's the same problem, it just cant handle too big batch sizes.

NOTE: This is not in any way a criticism of the great work being done, it's a natural consequence of the way this type of TTS works.

KaleidoscopicPrism commented 1 year ago

i appear to be getting the same response after doing the 2nd solution

KaleidoscopicPrism commented 1 year ago

i'm pretty sure i didn't even do the solution right so can you help me with this i'm not exactly smart

if torch.cuda.is_available():
    _, available = torch.cuda.mem_get_info()
    availableGb = available / (1024**3)
    if availableGb > 14:
        return 16
    elif availableGb > 4:
        return 1
    elif availableGb > 3:
        return 1
return 1
eloop001 commented 1 year ago

Hi. In that case, you would want to look at the file tokenizer.py. It's a bit difficult to grasp, but it tries to split the input text into smaller chunks. You don't have to change anything in the file, but it shows that the "desired length" and max length variables are set to, I believe 200 and 300 accordingly. when you call /root/tortoise-tts-fast/scripts/tortoise_tts.py you have the option of setting these parameters. Look in line 93. you would call --textsplit "200,300". If you set that to, say, 50,80 you can see it might work, and then increase the numbers.

Alternatively you can set the parameter --low_vram true. It should pass some calculations to the CPU, the you might be able to render longer sentences. If they get too short, the output may sound weird.

If all else fails, cloud instances are actually fairly cheap. You can rent a VM on Google Cloud, with a T4 GPU for 0.88 USD per hour. Set that against your electrical bill for running your own PC, and what you investment in a bigger GPU will be, it's really worth it. When you sign up, you also get som free credits to use. (I'm not sponsored in any way!!). Amazon, and other have similar services, I just find GCP the easiest to use, and you pay by the second only when you are using the instance.

Finally I will say, keep away from Azure. The price vs. what you get is completely intransperant, and you will pay x3 times the price.

eloop001 commented 1 year ago

About batch sizes. I'm running on a cloud hosted machine at the moment. 48 Gigs Vram is not enough for a batch size of 128 :) a batch size of 64 seems to work well. +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A6000 On | 00000000:05:00.0 Off | Off | | 35% 67C P2 297W / 300W | 41996MiB / 49140MiB | 96% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

KaleidoscopicPrism commented 1 year ago

Would you like to tell me how I would set low_vram to true? Because the sentence is only 3 syllables long and I'm not sure how I would go about doing this.

eloop001 commented 1 year ago

Like this: python3 tortoise_tts.py --preset high_quality --voice daniel --low_vram true --vocoder BigVGAN <somefilename.txt

If you open tortoise_tts.py in a text editor you can see all the different options.

basically it pushes certain calculations to the CPU. The transfer of data between the GPU and CPU does create some overhead, and it's obviously an advantage that the CPU is as powerful as possible. I speculate that the L2 cache size of the CPU is important.