2noise / ChatTTS

A generative speech model for daily dialogue.
https://2noise.com
GNU Affero General Public License v3.0
30.52k stars 3.32k forks source link

CONFUSING! TELlamaModel error and torchaudio.save runtime error #676

Open ckgithub2019 opened 1 month ago

ckgithub2019 commented 1 month ago

1. this issue happens every time. but the README said "Optional: Install TransformerEngine if using NVIDIA GPU (Linux only), The adaptation of TransformerEngine is currently under development and CANNOT run properly now", I'm using GPU ann Linux, so do I must install it or not? confusing:

use default LlamaModel for importing TELlamaModel error: No module named 'transformer_engine'

2. I tested with a very simple code snippet, but it didn't work, and runtime error here:

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
text:   0%|▎                                                                                                  | 1/384(max) [00:00,  6.19it/sttext:  12%|███████████▌                                                                                     | 46/384(max) [00:00, 118.84it/s]
code:   1%|█                                                                                               | 22/2048(max) [00:00, 213.37it/sccode:  20%|██████████████████▊                                                                            | 406/2048(max) [00:01, 213.68it/s]
/home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py:245: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3697.)
  src = src.T
Traceback (most recent call last):
  File "/home/ck/ai_project/tts_server/basic_test.py", line 13, in <module>
    torchaudio.save(f"basic_output{i}.wav", torch.from_numpy(wavs[i]).unsqueeze(0), 24000)
  File "/home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torchaudio/_backend/utils.py", line 313, in save
    return backend.save(
           ^^^^^^^^^^^^^
  File "/home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py", line 316, in save
    save_audio(
  File "/home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py", line 257, in save_audio
    s.write_audio_chunk(0, src)
  File "/home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torio/io/_streaming_media_encoder.py", line 469, in write_audio_chunk
    self._s.write_audio_chunk(i, chunk, pts)
RuntimeError: Input Tensor has to be 2D.
Exception raised from validate_audio_input at /__w/audio/audio/pytorch/audio/src/libtorio/ffmpeg/stream_writer/tensor_converter.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7def216cbf86 in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7def2167add9 in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x57147 (0x7def20ee0147 in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torio/lib/libtorio_ffmpeg4.so)
frame #3: <unknown function> + 0x57afc (0x7def20ee0afc in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torio/lib/libtorio_ffmpeg4.so)
frame #4: torio::io::TensorConverter::convert(at::Tensor const&) + 0x33 (0x7def20ee2723 in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torio/lib/libtorio_ffmpeg4.so)
frame #5: torio::io::EncodeProcess::process(at::Tensor const&, std::optional<double> const&) + 0xbe (0x7def20ed15ee in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torio/lib/libtorio_ffmpeg4.so)
frame #6: torio::io::StreamingMediaEncoder::write_audio_chunk(int, at::Tensor const&, std::optional<double> const&) + 0xa5 (0x7def20edcd85 in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torio/lib/libtorio_ffmpeg4.so)
frame #7: <unknown function> + 0x3a306 (0x7dee63d09306 in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torio/lib/_torio_ffmpeg4.so)
frame #8: <unknown function> + 0x32bf7 (0x7dee63d01bf7 in /home/ck/anaconda3/envs/tts/lib/python3.11/site-packages/torio/lib/_torio_ffmpeg4.so)
frame #9: python() [0x528767]
<omitting python frames>
frame #12: python() [0x5cbeda]
frame #14: python() [0x5ec6a7]
frame #15: python() [0x5e8240]
frame #16: python() [0x5fd192]
frame #21: <unknown function> + 0x29d90 (0x7def73e29d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: __libc_start_main + 0x80 (0x7def73e29e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: python() [0x5bbac3]

The tested code here:

import ChatTTS
import torch
import torchaudio

chat = ChatTTS.Chat()
chat.load(compile=False) # Set to True for better performance

texts = ["chat T T S is a text to speech model designed for dialogue applications.", "[uv_break]it supports mixed language input [uv_break]"]

wavs = chat.infer(texts)

for i in range(len(wavs)):
    torchaudio.save(f"basic_output{i}.wav", torch.from_numpy(wavs[i]).unsqueeze(0), 24000)
fumiama commented 1 month ago

I'm using GPU ann Linux, so do I must install it or not?

No, you should not install it.

I tested with a very simple code snippet, but it didn't work

It's a known problem, see #635

ckgithub2019 commented 1 month ago

I'm using GPU ann Linux, so do I must install it or not?

No, you should not install it.

I tested with a very simple code snippet, but it didn't work

It's a known problem, see #635

it works, thanks. so that means all of "unsqueeze(0)" should be removed by default? the example is wrong or there are some other usages about "unsqueeze(0)"?

From example: torchaudio.save(f"output_sentencelevel{i}.wav", torch.from_numpy(wavs[i]).unsqueeze(0), 24000) torchaudio.save(f"output_wordlevel{i}.wav", torch.from_numpy(wavs[0]).unsqueeze(0), 24000)

fumiama commented 1 month ago

it works, thanks. so that means all of "unsqueeze(0)" should be removed by default?

No. In fact, some version (usually newer versions) of torchaudio will panic if unsqueeze(0) does not exist.

kevincobain2000 commented 1 week ago

Tried with sound file instead, which works

import soundfile
soundfile.write(output_filename, wavs[0], 24000)
fumiama commented 1 week ago

Tried with sound file instead, which works

import soundfile
soundfile.write(output_filename, wavs[0], 24000)

Yes. There're many alternatives to save audio.