bshall / UniversalVocoding

A PyTorch implementation of "Robust Universal Neural Vocoding"
https://bshall.github.io/UniversalVocoding/
MIT License
237 stars 41 forks source link

Generate audio from mag spectrogram #3

Open tunnermann opened 5 years ago

tunnermann commented 5 years ago

Hey, thanks for your work in this project, it is really good.

I'm trying to use this vocoder to generate wavs from magnitude spectrograms I generated using another neural network. Using griffin-lim gets me a nice audio, but kind of robotic, so I think your vocoder will improve it a lot.

The biggest difference between the parameters of the two networks are in n_ftt, my spectrograms use 1024 and your network use 2048. So, if I try to use your pre-trained model, changing only n_ftt the resulting audio is sped up a bit and the voice gets really high.

I tryed retraining the network changing only n_ftt, but the results where not good, it got a lot of noise.

Any leads on what I might try next?

bshall commented 5 years ago

Hi @tunnermann, no problem.

I've just done a bit of testing. Passing a mel spectrogram with num_fft = 1024 to the pretrained model does result in some distortion of the audio. However, when I changed num_fft in the config.json and retrained the model from scratch I got fairly good results. Here are some samples: samples.zip.

Did you do anything else besides changing the one line in config.json?

Also, I'd be happy to share the weights for this model with you if you'd like?

tunnermann commented 5 years ago

@bshall Thanks for your reply.

I did retrain the model with the new n_fft and got good results generating audio from wav files. Maybe my problem is in converting my spectrogram into mel spectrograms and feeding it to the network. I will investigate it further and also retrain the network directly with the generated spectrograms instead of spectrograms derived from the ground truth audio.

Thanks again.

bshall commented 5 years ago

Yeah, that sounds like a reasonable approach. Let me know how it goes or if I can help at all. You can also try finetuning the model on the generated spectrograms. Might make experimenting a little faster.

Approximetal commented 4 years ago

Hi,@bshall @tunnermann I met the same problem, when I use different parameters to extract mel spectrogram and retrain the model, but the loss stop arround 2.9 and the result has load noise. What can I do to adjust the model to get a better performance? Here is my config parameters and audio samples. I use several dataset incluing multiple langualges. "preprocessing": { "sample_rate": 16000, "num_fft": 1024, "num_mels": 80, "fmin": 40, "preemph": 0.97, "min_level_db": -100, "hop_length": 256, "win_length": 1024, "bits": 9, "num_evaluation_utterances" : 10 }, "vocoder": { "conditioning_channels": 128, "embedding_dim": 256, "rnn_channels": 896, "fc_channels": 512, "learning_rate": 1e-4, "schedule": { "step_size": 20000, "gamma": 0.5 }, "batch_size": 256, "checkpoint_interval": 10000, "num_steps": 5000000, "sample_frames": 40, "audio_slice_frames": 8 } audio_samples.zip

bshall commented 4 years ago

Hi @Approximetal,

My guess is that a hop-length of 256 is too large for a sample rate of 16kHz. At this hop-length each frame is 16ms of audio. Most TTS and vocoder implementations that I've seen use either 12.5ms or 10ms. The ones that use a hop-length of 256 typically have audio at a sample rate of 22050.

The ZeroSpeech2019 dataset is only recorded at 16kHz so my default was a hop-length of 200 (12.5ms).

Hope that helps!