aedocw / epub2tts

Turn an epub or text file into an audiobook
Apache License 2.0
433 stars 43 forks source link

out of VRAM error when supplying speaker files #171

Closed rejuce closed 6 months ago

rejuce commented 6 months ago

I tried now Example usage: epub2tts my-book.epub --start 4 --end 20 --xtts shadow-1.wav,shadow-2.wav,shadow-3.wav to supply 3 example files

It seems it loads first the model which fills almost up my 4Gb of VRAM, then computing speaker latents requests additional 104MB of VRAM which is then not there anymore and fails.

Detected CUDA files, patching ldflags Emitting ninja build file /home/jk/.cache/torch_extensions/py310_cu121/transformer_inference/build.ninja... Building extension module transformer_inference... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module transformer_inference... Time to load transformer_inference op: 0.0373845100402832 seconds [2024-01-07 09:44:56,629] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000} VRAM: 4294443008 Computing speaker latents... Reading from 4 to 4 0%| | 0/21 [00:00<?, ?it/s]Requested: 104873984 Free: 56203675 Total: 4294443008 Error: Workspace can't be allocated, no enough memory. ... Retrying (1 retries left) Requested: 104873984 Free: 56203675 Total: 1811846896

Can this be worked around? Or is it just not possible to supply custom speaker .wav on 4GB VRAM?

aedocw commented 6 months ago

You can try just supplying one custom speaker wav maybe? Also, how long were your samples? I don't think it uses more than 30 seconds per sample anyway.

I think there are other folks using this with GPUs with only 4gb of ram so it should work but now that I think about it, it was before I added support for deepspeed. I don't think you'll have enough ram for that. The other thing I would suggest is to try removing deepspeed from the environment.

I will also take this as an action for me, to add a "--deepspeed false" option to give the user an option to NOT try to use deepspeed even if the package is detected.

aedocw commented 6 months ago

I'm about to merge a small commit (that I DID test, I swear!) which adds the option "--no-deepspeed". If that arg is added, deepspeed will not be used. This will probably help and may allow you to use three speaker samples which does improve the voice output.

rejuce commented 6 months ago

yes that is working, thank you for your work on this great project. installing deepspeed did not seem to make much difference for me in inference speed anyway on rtx A2000

aedocw commented 6 months ago

That's excellent, glad this helped!