fatchord / WaveRNN

WaveRNN Vocoder + TTS
https://fatchord.github.io/model_outputs/
MIT License
2.14k stars 697 forks source link

Identifying the cause of these artifacts #90

Closed CorentinJ closed 5 years ago

CorentinJ commented 5 years ago

I've been using the alternative WaveRNN version for ~3 months with another version of tacotron and it has been working greatly. However at some point I made modifications to both my tacotron model and dataset, and ever since I cannot generate audio that sounds good with the alternative model anymore, whereas Griffin-Lim still works fine. Now my project is quite big so I don't want to be too much of a bother with all the details. I'm only asking if anyone has an idea of where the problem is.

I'm working on 16kHz audio but I've kept the same 50ms window and 12.5ms step time for generating the mel spectrograms. The upsample factors are (5, 5, 8). I had found voc_target = 8000 and voc_overlap = 800 to yield good outputs in the past.

My generated spectrograms are quite smooth: image But Griffin-Lim still does ok: https://puu.sh/DBv7q.wav While the alternative WaveRNN has these artifacts: https://puu.sh/DBv7U.wav. The model was trained for 110k steps with a batch size of 100.

Here is another example of bad output: https://puu.sh/DBvcF.wav I found that going back to the default parameters voc_target = 11000 and voc_overlap = 550 sort of got rid of the loud blowing sounds: https://puu.sh/DBvdF.wav, but it still doesn't sound good and there are these chirping noise (that really shouldn't be there at all, the input is just speech) still. Not batching the generation yields more or less the same audio.

fatchord commented 5 years ago

@CorentinJ I reckon that spectrogram you posted is not good enough to render good audio - I can't see any distinguishable harmonics/formants.

CorentinJ commented 5 years ago

Hmm, I suspected so but I thought that since griffin-lim was doing ok, so would this model. Anyway I think that my spectrograms are bad because I changed something in the loss of tacotron, which is outside the scope of this issue. Closing.

CorentinJ commented 5 years ago

So I've tried many different things and I just cannot get rid of them. I have recloned your repo from the latest commit and changed as little things as possible to be sure, but still. I am training with a batch size of 100 on ground truth spectrogram (so I'm not using those generated by my tacotron model, I'm using your melspectrogram function). For some epochs, the generated audios are ok, and for some they have the same noise: https://puu.sh/DDH7x.zip. You can see the artifacts in the waveform, they're the parts at 50k and 70k with strictly positive values: image

Sometimes they're entirely negative: image

You said elsewhere that this happened during training. But I can remember training for less time and with a smaller batch size and not having these artifacts a while back.

oytunturk commented 5 years ago

How many epochs did you train the WaveRNN model? 110K? More iterations might help.

On Sun, Jun 9, 2019 at 1:54 PM Corentin Jemine notifications@github.com wrote:

Reopened #90 https://github.com/fatchord/WaveRNN/issues/90.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fatchord/WaveRNN/issues/90?email_source=notifications&email_token=ABMAQJ3IQBPJBNIL3YQ33CTPZVUZLA5CNFSM4HS4PFU2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOR4DS3EI#event-2399612305, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMAQJ77JQOEXDQNUINJUQLPZVUZLANCNFSM4HS4PFUQ .

CorentinJ commented 5 years ago

209k with a batch size of 100, I used to get better results with 150k and a batch size of 32

oytunturk commented 5 years ago

Maybe it requires a different learning rate with larger batch size?

On Sun, Jun 9, 2019 at 3:33 PM Corentin Jemine notifications@github.com wrote:

209k with a batch size of 100, I used to get better results with 150k and a batch size of 32

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/fatchord/WaveRNN/issues/90?email_source=notifications&email_token=ABMAQJZFUWSGI72QKMSKAQTPZWAKTA5CNFSM4HS4PFU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXITR7I#issuecomment-500250877, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMAQJYJQIMSCTDM4J6IHJTPZWAKTANCNFSM4HS4PFUQ .

CorentinJ commented 5 years ago

No I think that's fine. Anyway, they nearly disappeared with enough training steps (~300k), just more than I had expected. Possibly there was something I was doing wrong with applying pre-emphasis to the spectrograms but not the target audio.