NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.26k stars 529 forks source link

Model can approximately fit but has consistent artifact on every audio synthesis #202

Open adrienchaton opened 4 years ago

adrienchaton commented 4 years ago

Hello everyone,

I am experimenting with training WaveGlow on a dataset of singing voice. The loss is optimizing without noticeable issue, although it seems to hardly get lower than around -4. (for 4000 epochs on a reduced dataset of 2500 samples). In overall, it can approximately fit but there is a strong artifact, like a fluctuation (although the waveform and frequencies can be decently correct locally). I attach some training reconstructions to detail what it sounds like:

train_rec.zip

Any guess on what that could be coming from ?

I wondered if that is due to the squeeze/unsqueeze of the audio, which could make some phase alignment issues ? But since the squeeze dimension is little (less than a msec), it would rather create very high frequency artifacts I imagine.

Any possible issue related to conditioning information ?

Thanks !

rafaelvalle commented 4 years ago

Is this VocalSet? :-) How does it sound with the pre-trained model?

adrienchaton commented 4 years ago

Yes it is :)

I did not try your pretrained model for Mel-Spectrogram inversion of speech.

I use VocalSet for experimenting with models for musical sound synthesis as I find it of good quality and "challenging". Because it covers a large tessitura as well as diverse techniques, so a model performing well at it usually does a good job with strings, winds, brass etc.

rafaelvalle commented 4 years ago

Let us know how it sounds on VocalSet.

adrienchaton commented 4 years ago

Unfortunately my experiments with WaveGlow on VocalSet have not been concluding ..

I was interested in descriptor based synthesis, training the model with audio descriptor enveloppes instead of Mel-Spectrograms, and then be able to control synthesis from a set of acoustic descriptors.

Of course, one issue might be that audio descriptors are a less strong / unequivocal conditioning signal than spectrograms.

Although they are interesting as control variables, and I was hoping that they might be also faster to learn as a more compact conditioning information. It proved not so far ..

I tried modifying WaveGlow in a couple of ways, adding activation normalization, and also using a more expressive affine coupling transform (flow++) but after 60-80 hours training (single Titan V) all these trials remained little concluding.

Your code is nicely done, thank you for that, and flows are appealing, although still hard to train. Advancements in GAN based sound synthesis are maybe more suited to prototyping less covered experiments than spectrogram inversion. I'm experimenting with modifying MelGAN, it seems encouraging !