Not generating audio. None at all. Generates graphs, but pitch silent wav file.

BenAAndrew / Voice-Cloning-App

A Python/Pytorch app for easily synthesising human voices

BSD 3-Clause "New" or "Revised" License

1.4k stars 233 forks source link

Not generating audio. None at all. Generates graphs, but pitch silent wav file. #69

Open FlashlightET opened 3 years ago

FlashlightET commented 3 years ago

Neither the app nor colab produce audio, but they both do produce graphs File from program: https://cdn.discordapp.com/attachments/879773649714962553/884898833803399188/out.wav File from colab (michael rosen model): https://cdn.discordapp.com/attachments/879773649714962553/884898840522670140/download.wav The colab seems to be generating actual audio though, since it shows a spectrogram, it just doesnt want to make it a wav file?

BenAAndrew commented 3 years ago

The issue here is due to the model/synthesis quality. A typical alignment graph should look more like this.

graph

If you cannot see a clear line forming, then the alignment is so poor that it won't be able to produce audible results. Please post a message in the #help-wanted channel of our discord if you'd like assistance identifying why the model/synthesis is not good quality

BenAAndrew commented 2 years ago

To follow up on this issue, here are some common reasons why your voice may be poor: https://benaandrew.github.io/Voice-Cloning-App/training/#verifying-quality

junqilu commented 1 year ago

Thank you for developing this application and sorry for the bothering--I posted my question on discord but did not get any responses. I run into the same issue where the alignment graphs look legit for both the training and synthesis steps, but the generated audio is silent. I'm using the local .exe for the project and every step was done on my computer locally.

According to your instructions on verifying the quality, I have

Train loss: about 0.08 < 0.5
Attention score: about 0.77 > 0.3
Validation score: about 0.73
Alignment graph: looks like a clear line to me

Also I am pretty sure that my training has passed at least 1000 epochs. For the vocoder, I tried both the one provided by your documentation page and the LJ_FT_T2_V3 from hifi-gan github page, but both gave silent audio but with a legit alignment graph. 2023-01-11_131826

Could you provide some suggestions what might be an issue here? Really appreciate it!

SirBitesalot commented 1 year ago

What GPU are you using? I remember that some GPUs had some weird issue with some other ML stuff.

junqilu commented 1 year ago

I have 2 GPU on my computer--1 is Intel(R) Irus(R) Plus Graphics and the other is NVIDIA GeForce GTX 1660 Ti with Max-Q Design. I followed your video tutorial and installed the drive for NVIDIA GPU, so I always assume NVIDIA GPU is the one that the application is using.

SirBitesalot commented 1 year ago

its an issue with 1660/1660TI I think https://github.com/pytorch/pytorch/issues/58123 You could try to use another version of CUDNN. Or try to run the app from source see if removing .half() can fix it like here. https://github.com/NVIDIA/tacotron2/issues/475

junqilu commented 1 year ago

Thanks for the suggestions! I will definitely give it a shot. Does this mean my previous trained model is completely useless now and I should start fresh after making the changes as you suggested?

SirBitesalot commented 1 year ago

I think the model should be fine as only the inference uses .half() and it produces a normal graph.

duckfromdiscord commented 1 year ago

im getting silence with a normal graph and a 1660ti