SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.49k stars 928 forks source link

Error in inference. Audio output with no content, all silence #356

Closed Bubarinokk closed 2 days ago

Bubarinokk commented 3 weeks ago

warnings.warn( You have passed task=transcribe, but also have set forced_decoder_ids to [[1, None], [2, 50360]] which creates a conflict. forced_decoder_ids will be ignored in favor of

SWivid commented 3 weeks ago

will this block the inference? if just warning and inference goes well, it is normal case

Bubarinokk commented 3 weeks ago

yes, it is blocking

SWivid commented 3 weeks ago

Could you provide more infos? e.g. a full screenshot of command line output The info here

warnings.warn( You have passed task=transcribe, but also have set forced_decoder_ids to [[1, None], [2, 50360]] which creates a conflict. forced_decoder_ids will be ignored in favor of

will not cause error

medoderi commented 3 weeks ago

I've encountered the same issue and have attempted every suggested solution, but none have been effective.

To create a public link, setshare=Trueinlaunch(). C:\Users\+++-\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\whisper\generation_whisper.py:509: FutureWarning: The input nameinputsis deprecated. Please make sure to useinput_featuresinstead. warnings.warn( You have passed task=transcribe, but also have setforced_decoder_idsto [[1, None], [2, 50360]] which creates a conflict.forced_decoder_ids` will be ignored in favor of task=transcribe.

`

omnific9 commented 3 weeks ago

I have the same problem. I'm not sure if the warning itself is blocking inference, but inference is indeed blocked. Running on Windows 11

D:\github\F5-TTS\venv\lib\site-packages\transformers\models\whisper\generation_whisper.py:509: FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.

image

EDIT: I tried the cli. It works. Looks like problem lies in the Gradio interface.

medoderi commented 3 weeks ago

I have the same problem. I'm not sure if the warning itself is blocking inference, but inference is indeed blocked. Running on Windows 11

D:\github\F5-TTS\venv\lib\site-packages\transformers\models\whisper\generation_whisper.py:509: FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.

image

EDIT: I tried the cli. It works. Looks like problem lies in the Gradio interface.

any solution?

omnific9 commented 3 weeks ago

TL;DR: If you're running it on Windows in the UI, always provide a reference text in the Advanced Settings, to make it run.

OK. I found the issue. It lies in the transcription.

In this line: https://github.com/SWivid/F5-TTS/blob/b0f482421b03e187ee7ca1893458f383e2c289d3/src/f5_tts/infer/utils_infer.py#L126

I changed whisper-large-v3-turbo to whisper-base, and I was able to get the result after a while. I suspect that when it was whisper-large-v3-turbo, it just took way longer because it's a larger model. So when we see it "stuck", it's actually running the transcription. It just was taking too much time.

Somehow when running in cli this isn't an issue, but when running in gradio it's extremely slow.

Even when running the whisper-base, it's still very slow, compared to what it should be (on a 4090 GPU)

Potentially some conflicts exist between gradio and the huggingface pipeline libraries on Windows? Windows is a weird system.

I have the same problem. I'm not sure if the warning itself is blocking inference, but inference is indeed blocked. Running on Windows 11

D:\github\F5-TTS\venv\lib\site-packages\transformers\models\whisper\generation_whisper.py:509: FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.

image EDIT: I tried the cli. It works. Looks like problem lies in the Gradio interface.

any solution?

SWivid commented 3 weeks ago

Based on current infos, it seems more like a network issue. The inference is blocked cuz the fetching process for openai/whisper-large-v3-turbo is stuck, if ctrl-c when stuck will probably see something like "connect() xxxxx" which means the process is stuck at there.

some possible solutions:

  1. leverage a vpn
  2. set in command line environment export HF_ENDPOINT=https://hf-mirror.com
  3. manually download whisper model and place under C:\Users\YOURUSERNAME\.cache\huggingface\hub\models--openai--whisper-large-v3-turbo, check for some online tutorials regarding how to use local checkpoint for huggingface models
omnific9 commented 3 weeks ago

Based on current infos, it seems more like a network issue. The inference is blocked cuz the fetching process for openai/whisper-large-v3-turbo is stuck, if ctrl-c when stuck will probably see something like "connect() xxxxx" which means the process is stuck at there.

some possible solutions:

  1. leverage a vpn
  2. set in command line environment export HF_ENDPOINT=https://hf-mirror.com
  3. manually download whisper model and place under C:\Users\YOURUSERNAME\.cache\huggingface\hub\models--openai--whisper-large-v3-turbo, check for some online tutorials regarding how to use local checkpoint for huggingface models

Nope. I saw the large model fully downloaded. And when it was running, I could hear my GPU humming (the same way it hums when running whisper-base)

SWivid commented 3 weeks ago

@omnific9 if comment out this line would help? https://github.com/SWivid/F5-TTS/blob/f7e248e2ced0f1bc6885093d29893a1e4463bc71/src/f5_tts/infer/utils_infer.py#L127 or it's probably still some problems with the pipeline, no idea then 😔

omnific9 commented 3 weeks ago

@omnific9 if comment out this line would help?

https://github.com/SWivid/F5-TTS/blob/f7e248e2ced0f1bc6885093d29893a1e4463bc71/src/f5_tts/infer/utils_infer.py#L127

or it's probably still some problems with the pipeline, no idea then 😔

Huh... that worked. So whisper-large doesn't work with float16? Or is this only a problem on Windows?

SWivid commented 3 weeks ago

So whisper-large doesn't work with float16? Or is this only a problem on Windows?

maybe gpu? what gpu device are you using?

omnific9 commented 3 weeks ago

So whisper-large doesn't work with float16? Or is this only a problem on Windows?

maybe gpu? what gpu device are you using?

RTX 4090

SWivid commented 3 weeks ago

RTX 4090

rtx4090 definitely support fp16, then it's probably a problem with platform (window/linux, torch cuda versions, transformers pipeline), dunno, which is not clear based on current info

hongyu2024 commented 3 weeks ago

I'm also having the same issue, I'm sure the model: whisper-large-v3-turbo has been fully downloaded to the local.

armthug213 commented 3 weeks ago

so Mr. @SWivid , what the fix for my issue ... thanks my issue is the audio output without content, silent.. tried to use VPN and still didn't worked ..

inikn

SWivid commented 3 weeks ago

@omnific9 if comment out this line would help?

https://github.com/SWivid/F5-TTS/blob/f7e248e2ced0f1bc6885093d29893a1e4463bc71/src/f5_tts/infer/utils_infer.py#L127

or it's probably still some problems with the pipeline, no idea then 😔

@armthug213 have you tried this?

And what torch and cuda version are you using (this might help figure out a global solution addressing the issue) do pip show torch and nvcc -V

armthug213 commented 3 weeks ago

@omnific9 if comment out this line would help? https://github.com/SWivid/F5-TTS/blob/f7e248e2ced0f1bc6885093d29893a1e4463bc71/src/f5_tts/infer/utils_infer.py#L127

or it's probably still some problems with the pipeline, no idea then 😔

@armthug213 have you tried this? @SWivid not yet

And what torch and cuda version are you using (this might help figure out a global solution addressing the issue) do pip show torch and nvcc -V

here hyjyu

SWivid commented 3 weeks ago

@armthug213 the way of comment out fp16 for pipeline probably work. And thanks for providing torch cuda version info, seems all right.

One last thing I could think of is transformers pkg version, for me 4.39.3 on linux works well and 4.45.2 also fine for windows. If that doesn't change anything, I'm like at a loss what's is going wrong there cuz cannot reproduce this failure from my side.

armthug213 commented 3 weeks ago

@SWivid @omnific9 just did tht and still no change same empty output !!

tg4t

armthug213 commented 3 weeks ago

One last thing I could think of is transformers pkg version, for me 4.39.3 on linux works well and 4.45.2 also fine for windows. If that doesn't change anything, I'm like at a loss what's is going wrong there cuz cannot reproduce this failure from my side.

@SWivid so how i would tackle this transformers pkg? can you walk me thru the process thanks

SWivid commented 3 weeks ago

@armthug213 so how is the command line output after you comment out torch_dtype=torch.float16? If you end at all failures with asr pipeline, I would suggest you pass in a ref_text rather than using asr transcription (cuz it is not clear how your case can be reproduced).

armthug213 commented 3 weeks ago

@armthug213 so how is the command line output after you comment out torch_dtype=torch.float16?

tyjuqqqq

armthug213 commented 3 weeks ago

@omnific9 did it worked for you? if yes , what the fix was and could you walk me thru the fix process I'm not that techie .. i appreciate it thanks

medoderi commented 3 weeks ago

Update : I downloaded whisper_target_v3_turbo and the program worked for me, but it is very slow. It took more than 10 minutes for approximately 8 words. When I checked the task, I found it was using the CPU, not the GPU! Can the GPU be specified?

I have an Intel Xe GPU and an Intel Arc A350M. The processor is i9, Win11.

TL;DR: If you're running it on Windows in the UI, always provide a reference text in the Advanced Settings, to make it run.

OK. I found the issue. It lies in the transcription.

In this line: https://github.com/SWivid/F5-TTS/blob/b0f482421b03e187ee7ca1893458f383e2c289d3/src/f5_tts/infer/utils_infer.py#L126

I changed whisper-large-v3-turbo to whisper-base, and I was able to get the result after a while. I suspect that when it was whisper-large-v3-turbo, it just took way longer because it's a larger model. So when we see it "stuck", it's actually running the transcription. It just was taking too much time.

Somehow when running in cli this isn't an issue, but when running in gradio it's extremely slow.

Even when running the whisper-base, it's still very slow, compared to what it should be (on a 4090 GPU)

Potentially some conflicts exist between gradio and the huggingface pipeline libraries on Windows? Windows is a weird system.

I have the same problem. I'm not sure if the warning itself is blocking inference, but inference is indeed blocked. Running on Windows 11

D:\github\F5-TTS\venv\lib\site-packages\transformers\models\whisper\generation_whisper.py:509: FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.

image EDIT: I tried the cli. It works. Looks like problem lies in the Gradio interface.

any solution?

SWivid commented 3 weeks ago

@medoderi gpustat (need pip install gpustat) or nvidia-smi see the rank then CUDA_VISIBLE_DEVICES=0 f5-tts_infer-gradio if rank0

281807424 commented 3 weeks ago

image A silent output audio too. It shows "Using custom reference text", and the cuda is running. I guess it's not about whisper or GPU

hongyu2024 commented 3 weeks ago

A silent output audio。The methods above don't seem to work for me, whether in Gradio or cli. Can anyone offer some help?thanks.. @SWivid 3a73bda7cb0e854fa8efe6c245f88aa2

(.venv) D:\TTS\F5-TTS>f5-tts_infer-cli Download Vocos from huggingface charactr/vocos-mel-24khz Using F5-TTS...

vocab : D:\TTS\F5-TTS.venv\lib\site-packages\f5_tts\infer\examples\vocab.txt tokenizer : custom model : C:\Users\Dell.cache\huggingface\hub\models--SWivid--F5-TTS\snapshots\995ff41929c08ff968786b448a384330438b5cb6\F5TTS_Base\model_1200000.safetensors

Converting audio... Using custom reference text... Voice: main Ref_audio: C:\Users\Dell\AppData\Local\Temp\tmp0xdgir1v.wav Ref_text: Some call me nature, others call me mother nature. No voice tag found, using main. Voice: main gen_text 0 I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Generating audio in 1 batches... 0%| | 0/1 [00:00<?, ?it/s]Building prefix dict from the default dictionary ... Loading model from cache C:\Users\Dell\AppData\Local\Temp\jieba.cache Loading model cost 1.277 seconds. Prefix dict has been built successfully. D:\TTS\F5-TTS.venv\lib\site-packages\f5_tts\model\modules.py:436: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False) 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [02:10<00:00, 130.72s/it] tests\infer_cli_out.wav

(.venv) D:\TTS\F5-TTS>pip show torch Name: torch Version: 2.3.0+cu118 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: d:\tts\f5-tts.venv\lib\site-packages Requires: filelock, fsspec, jinja2, mkl, networkx, sympy, typing-extensions Required-by: accelerate, bitsandbytes, ema-pytorch, encodec, f5-tts, torchaudio, torchdiffeq, vocos, x-transformers

(.venv) D:\TTS\F5-TTS>nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

tests\infer_cli_out.wav is a silent file....

SWivid commented 3 weeks ago

@hongyu2024 https://github.com/SWivid/F5-TTS/blob/61ff2a62d9487e3362ffa5680007e788ad764065/src/f5_tts/infer/utils_infer.py#L324 change to audio, sr = torchaudio.load(ref_audio, backend="soundfile") check if works

armthug213 commented 3 weeks ago

@281807424 @hongyu2024 @Bubarinokk @omnific9

has any one of you installed ComfyUI Locally ? kindly let me know im doing some investigation

hongyu2024 commented 3 weeks ago

@281807424 @hongyu2024 @Bubarinokk @omnific9

has any one of you installed ComfyUI Locally ? kindly let me know im doing some investigation

no install

armthug213 commented 3 weeks ago

ojoj

did a search on my Laptop of the term "whisper" and found this two folders, it's should be 2 folders ? BTW I installed ComfyUI Locally 2 months ago.

@SWivid @omnific9

SWivid commented 3 weeks ago

@armthug213 does ComfyUI do anything related to this repo or this issue? and you are searching for the term "whisper" for what purpose?

armthug213 commented 3 weeks ago

@armthug213 does ComfyUI do anything related to this repo or this issue? and you are searching for the term "whisper" for what purpose?

just try to find the root of this issue of Audio output with no content.. if there is a conflict causing this issue..

....

BTW in my process while installing F5-TTS, i faced an issue with Pytorch ,(i couldn't launch the F5-TTS interface) and i solved it with this video fix: https://www.youtube.com/watch?v=ca34C8ZUI0A

SWivid commented 3 weeks ago

just try to find the root of this issue of Audio output with no content.. if there is a conflict causing this issue..

yes, it may be the cause. try using a separate env

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n f5-tts python=3.10
conda activate f5-tts
hongyu2024 commented 3 weeks ago

@hongyu2024

https://github.com/SWivid/F5-TTS/blob/61ff2a62d9487e3362ffa5680007e788ad764065/src/f5_tts/infer/utils_infer.py#L324

change to audio, sr = torchaudio.load(ref_audio, backend="soundfile") check if works

Thank... modified L324. Change venv to conda. Still won't work. I'll try a different os...

0A43D2AF-8535-43a7-BCDE-2D6182D0D19F

99D4010F-CFD9-442e-A5DA-826FD395C2F0

SWivid commented 3 weeks ago

@hongyu2024 so you have already tried dtype=torch.float32 ?

hongyu2024 commented 3 weeks ago

@hongyu2024 so you have already tried dtype=torch.float32 ?

Thank you so much! this is effective, the sound is output(cli & gradio)

A81792EF-8BDC-4122-B969-DE5E7620F45B

Maenod commented 3 weeks ago

@hongyu2024 so you have already tried dtype=torch.float32 ?

I tried to change it but I encounter a Error :( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU

SWivid commented 3 weeks ago

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU

You haven't provide the GPU info in that issue, if you are using a GPU with relatively limited memory, provide ref_text rather than using ASR model to transcribe.

Maenod commented 3 weeks ago

GPU info

I have ( NVIDIA GetForce GTX 1650 ) - 8 Gbps

Thank you a lot :) it works and generate an audio but it show me some errors as I mentioned below in command prompt, is anything else I need to change.

Error Message:::

Starting app... Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). gen_text 0 Welcome to your home Building prefix dict from the default dictionary ... Loading model from cache C:\Users\maena\AppData\Local\Temp\jieba.cache Loading model cost 0.682 seconds. Prefix dict has been built successfully. C:\Users\maena\Desktop\F5-TTS\src\f5_tts\model\modules.py:436: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False) C:\ProgramData\miniconda3\envs\f5\lib\site-packages\gradio\processing_utils.py:574: UserWarning: Trying to convert audio automatically from float32 to 16-bit int format. warnings.warn(warning.format(data.dtype))

SWivid commented 3 weeks ago

Thank you a lot :) it works and generate an audio but it show me some errors as I mentioned below in command prompt, is anything else I need to change.

if working is fine, just ignore the warning

armthug213 commented 3 weeks ago

Good news :) it finally worked for me ...

@hongyu2024 so you have already tried dtype=torch.float32 ?

Thank you so much! this is effective, the sound is output(cli & gradio)

A81792EF-8BDC-4122-B969-DE5E7620F45B

i used this ☝..and other edit suggestions mentioned here i will provide screenshot to all what i did on _utilsinfer.py file:

ioho kguik kughoku

really not sure which one exactly that made it works , but it worth knowing that

281807424 commented 3 weeks ago

@hongyu2024 so you have already tried dtype=torch.float32 ?

Thank you so much! this is effective, the sound is output(cli & gradio)

A81792EF-8BDC-4122-B969-DE5E7620F45B

Thanks! It works for me too, though I have no idea why.

medoderi commented 3 weeks ago

@medoderi gpustat (need pip install gpustat) or nvidia-smi see the rank then CUDA_VISIBLE_DEVICES=0 f5-tts_infer-gradio if rank0

This is specific to Nvidia; however, I have an Intel Xe GPU and an Intel Arc A350M. Is there a way to switch interface usage from CPU to GPU?

SWivid commented 3 weeks ago

@medoderi not sure about intel GPU, maybe try torch 2.4 and .to(device) with .to("xpu")

sankexin commented 2 weeks ago

Good news :) it finally worked for me ...

@hongyu2024 so you have already tried dtype=torch.float32 ?

Thank you so much! this is effective, the sound is output(cli & gradio) A81792EF-8BDC-4122-B969-DE5E7620F45B

i used this ☝..and other edit suggestions mentioned here i will provide screenshot to all what i did on _utilsinfer.py file:

ioho kguik kughoku

really not sure which one exactly that made it works , but it worth knowing that

Thanks! It works for me too,need to use three:

modify F5-TTS/src/f5_tts/infer/utils_infer.py
1、
asr_pipe = pipeline(
        "automatic-speech-recognition",
        model="openai/whisper-large-v3-turbo",
        # torch_dtype=dtype,
        device=device,
    )

2、
dtype = torch.float32 # if mel_spec_type == "bigvgan" else None

3、
audio, sr = torchaudio.load(ref_audio, backend="soundfile")
suwei999 commented 2 weeks ago

这个不需要该代码只需要修改cuda版本即可解决此问题

SiddiumCore commented 1 week ago

这个不需要该代码只需要修改cuda版本即可解决此问题

How to modify the CUDA version?

fatih-dogmus commented 1 week ago

这个不需要该代码只需要修改cuda版本即可解决此问题

No it doesn't matter

resim