Issue on audio ratio for XTTS

llongour commented 4 months ago

Hi ! Work great with VITS and MS but every use of XTTS (custom or studio, like Damien Black example) crashed.

Computing speaker latents...
Reading from 83 to 83
  0%|                                                                                                                                                                                        | 0/5 [00:00<?, ?it/s]Error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
 * (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)
 ... Retrying (1 retries left)
Error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
 * (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)
 ... Retrying (0 retries left)
  0%|                                                                                                                                                                                        | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/lucas/apps/epub2tts/.venv/bin/epub2tts", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/lucas/apps/epub2tts/.venv/lib/python3.11/site-packages/epub2tts.py", line 945, in main
    mybook.read_book(
  File "/home/lucas/apps/epub2tts/.venv/lib/python3.11/site-packages/epub2tts.py", line 648, in read_book
    f"Something is wrong with the audio ({ratio}): {tempwav}"
                                          ^^^^^
UnboundLocalError: cannot access local variable 'ratio' where it is not associated with a value

Any idea ?

aedocw commented 4 months ago

Can you paste the full exact command you called epub2tts with?

Also, MAYBE try again with the addition of --minratio 0 and see if that works (though I don't think it will). I suspect you've found an actual bug, but it might have to do with the specific way you're calling the script.

acerbusace commented 4 months ago

Hi, I'm getting the same error. I'm calling this via the same command structure as mentioned in readme example.

epub2tts mybook.txt --engine xtts --speaker "Damien Black" --cover mybookcover.png --sayparts

aedocw commented 4 months ago

Maybe this has something to do with CUDA, so far I am unable to reproduce. Can you share platform details, as well as the output at the start of the run, similar to what I see testing on Windows under WSL with a machine that has an nVidia GPU:

Saving to sample-damien-black.m4b                                                                                                     
Total characters: 727                                                                                                                 
Using GPU                                                                                                                             
VRAM: 8589410304                                                                                                                      
Loading model: /home/doc/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2                                            
 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.                                                               
 > Using model: xtts                                                                                                                  
[2024-05-24 08:33:08,729] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)               
[2024-05-24 08:33:09,080] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unkno
wn                                                                                                                                    
[2024-05-24 08:33:09,081] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This
 parameter is no longer needed, please remove from your call to DeepSpeed-inference                                                   
[2024-05-24 08:33:09,081] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_p
arallel.tp_size instead                                                                                                               
[2024-05-24 08:33:09,081] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 
Using /home/doc/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...                                                     
Detected CUDA files, patching ldflags                                                                                                 
Emitting ninja build file /home/doc/.cache/torch_extensions/py310_cu121/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.0545198917388916 seconds
[2024-05-24 08:33:10,075] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, '
intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType
.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_maski
ng': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple
': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 
'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantiza
tion': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False
, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000}
VRAM: 8589410304
Computing speaker latents...
Reading from 1 to 2

acerbusace commented 4 months ago

I'm also running it on WSL2 Ubuntu. Originally I had CUDA toolkit installed via sudo apt install nvidia-cuda-toolkit, however that caused errors with CUDA when using xtts option. Hence, I uninstalled and followed instructions from nvidia to download the latest version (v12.5).

Here are my platform details:

CPU: 13th Gen Intel(R) Core(TM) i9-13900K
GPU: NVIDIA GeForce RTX 4090

Here is the output from the start of the run:

Saving to mybook-damien-black.m4b
Total characters: 1415
Using GPU
VRAM: 25756696576
Loading model: /home/user/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2
 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
 > Using model: xtts
[2024-05-24 14:49:04,274] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2024-05-24 14:49:04,990] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown
[2024-05-24 14:49:04,990] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2024-05-24 14:49:04,990] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2024-05-24 14:49:04,991] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /home/user/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/user/.cache/torch_extensions/py310_cu121/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.10877370834350586 seconds
[2024-05-24 14:49:06,051] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000, 'invert_mask': True}
VRAM: 25756696576
Computing speaker latents...
Reading from 1 to 1
  0%|                                                                                                                                            | 0/14 [00:00<?, ?it/s]Error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
 * (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)
 ... Retrying (1 retries left)
Error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
 * (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)
 ... Retrying (0 retries left)
  0%|                                                                                                                                            | 0/14 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/user/.local/bin/epub2tts", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/epub2tts.py", line 945, in main
    mybook.read_book(
  File "/home/user/.local/lib/python3.10/site-packages/epub2tts.py", line 648, in read_book
    f"Something is wrong with the audio ({ratio}): {tempwav}"
UnboundLocalError: local variable 'ratio' referenced before assignment

aedocw commented 4 months ago

I don't see anything obvious going on here unfortunately. Could you try one more thing? Add --no-deepspeed --debug to your command and let's see what specific sentence it's failing on, AND see if maybe it's choking on something to do with deepspeed.

acerbusace commented 4 months ago

Unfortunately, seems to still be the same output...

epub2tts mybook.txt --engine xtts --speaker "Damien Black" --sayparts --no-deepspeed --debug
Namespace(sourcefile='mybook.txt', engine='xtts', xtts=None, openai=None, model='tts_models/en/vctk/vits', speaker='Damien Black', scan=False, start=1, end=999, language='en', minratio=88, skiplinks=False, skipfootnotes=False, sayparts=True, audioformat='m4b', bitrate='69k', debug=True, export=None, no_deepspeed=True, skip_cleanup=False, cover=None)
Language selected: en
in main, Speaker is Damien Black
Section speakers: ['Damien Black']
...
Saving to mybook-damien-black.m4b
Total characters: 1415
Using GPU
VRAM: 25756696576
Loading model: /home/user/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2
 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
 > Using model: xtts
VRAM: 25756696576
Computing speaker latents...
Reading from 1 to 1
  0%|                                                                                                                                            | 0/14 [00:00<?, ?it/s]Information.
Error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
 * (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)
 ... Retrying (1 retries left)
Information.
Error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:
 * (Tensor elements, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Number element, Tensor test_elements, *, bool assume_unique, bool invert, Tensor out)
 * (Tensor elements, Number test_element, *, bool assume_unique, bool invert, Tensor out)
 ... Retrying (0 retries left)
  0%|                                                                                                                                            | 0/14 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/user/.local/bin/epub2tts", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/epub2tts.py", line 945, in main
    mybook.read_book(
  File "/home/user/.local/lib/python3.10/site-packages/epub2tts.py", line 648, in read_book
    f"Something is wrong with the audio ({ratio}): {tempwav}"
UnboundLocalError: local variable 'ratio' referenced before assignment

aedocw commented 4 months ago

If it's not something weird coming from the text itself (like somehow it's trying to pass a special character, or pass an empty string to TTS), I'm not sure what this could be.

One more confirmation please, just to be sure Coqui TTS is working OK on it's own, try: tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --speaker_idx 'Damien Black' --text "This is a test" --out_path test.wav --language_idx en

If that works, and using super simple text like "sample.txt" from the repo still throws the error, I might be out of ideas.

two-9 commented 4 months ago

Downgrade transformers to 4.40.2

pip install transformers==4.40.2

transformers/generation/utils.py: _prepare_attention_mask_for_generation() is broken as of 4.41.0

This only seems to affect model.inference_stream(). Invoking tts from the command line still works in transformers 4.41.0.

aedocw commented 4 months ago

@acerbusace can you try what @two-9 suggested and see if that solves it? If it does, I'll update requirements.txt and pin to that version of transformers.

acerbusace commented 4 months ago

Can confirm, downgrading transformers to 4.40.2 fixes the issue!

aedocw commented 4 months ago

Excellent! Thanks for checking that, and thank you a ton to @two-9 for the fix, probably would have taken me a while to remember to start googling for tensor errors haha!

aedocw / epub2tts

Issue on audio ratio for XTTS #243