fatihkiralioglu commented 4 years ago

Hi, I'm trying to train 16kHz models for both waveglow and tacotron2. for 16k tacotron I have used win_length=800 and hop_length=200, It has produced good results with 22k pretrained waveglow model. In order to get better results I want to train an 16khz waveglow model I guess that the same parameter values of 800 and 200 should be used for waveglow training. When I use these new parameters instead of 1024 and 256, can I still use pretrained 22k waveglow model for warmstart? I have some reservations because pretrained 22k waveglow model is trained with win_length:1024 and hop_length:200 Thanks.

ashish-roopan commented 4 years ago

Someone please answer this question.I trained the model after loading the pretrained weights ,but after 14K steps the audio is full of noise.

mychiux413 commented 4 years ago

I got the same issue.

I used waveglow_256channels_universal_v5.pt as the pretrained model
I used LJSpeech + VCTK in 16kHz for training data with trimmed silence.

The v5 model should be trained by mel spec with :

"sampling_rate": 22050,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"mel_fmin": 0.0,
"mel_fmax": 8000.0

my mel spec was:
```
"sampling_rate": 16000,
"filter_length": 768,
"hop_length": 192,
"win_length": 768,
"mel_fmin": 0.0,
"mel_fmax": 8000.0
```
Before training, I used the v5 model(22k pretrained) to infer my mel spec, the speech was still audible(even male's spec), of course the pitch must all shifted down if I choosed to output frame-rate as 16kHz.

After training with the pre-trained model, the loss could fast drop to ~-5.0 after few steps, in the period of my 25k steps, the losses were ~-5.5 around, but all the audio which inferenced by 25k steps checkpoint were all full of noise(almost no sound).

Of course if I trained without pre-trained model, the loss will drop very slowly, and the inference results were also full of noise.

mychiux413 commented 4 years ago

Maybe we could try to modify the code as #88 , then try again.

ashish-roopan commented 4 years ago

So after training the pre-trained model for 25k steps,you are still getting noisy output? I also faced the same issue ,the output I got after inference with waveglow_256channels_universal_v5.pt was at least audible. I also got the same loss around -6.

ashish-roopan commented 4 years ago

88 may work

mychiux413 commented 4 years ago

after #88 , training 16kHz with pre-trained model is not available anymore, because the WaveGlow.upsample depends on the win_length/hop_length.

ashish-roopan commented 4 years ago

Yes,I also faced the same issue.So I trained the model from scratch.After 100K steps ,the audio quality is not improving much . The generated audio has audible speech ,but has some noise.Do you know how much steps is required for getting results similar to official model?

ashish-roopan commented 4 years ago

Have you tried #99?Can we train 16KHz with pre-trained model using this code?

HiiamCong commented 4 years ago

Hi, I currently have a problem with 16kHz waveglow training My Tacontron2 model is ok (tested with pre-trained WaveGlow model). I'm trying to train waveglow from scratch. I used WaveGlow code at master branch with below config.json

"train_config":
"fp16_run": true,
"output_directory": "checkpoints",
"epochs": 100000,
"learning_rate": 1e-4,
"sigma": 1.0,
"iters_per_checkpoint": 2000,
"batch_size": 12,
"seed": 1234,
"checkpoint_path": "",
"with_tensorboard": false

"data_config":
"training_files": "train_files.txt",
"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 800,
"hop_length": 200,
"win_length": 800,
"mel_fmin": 0.0,
"mel_fmax": 8000.0

"waveglow_config":
"n_mel_channels": 80,
"n_flows": 12,
"n_group": 8,
"n_early_every": 4,
"n_early_size": 2,
"WN_config": {
    "n_layers": 8,
    "n_channels": 256,
    "kernel_size": 3
}

I have trained for 236k steps and every output audios are silence. Hope u guys could give me some light :( Output audio: https://drive.google.com/drive/folders/1hqVHOVoZISP3-BxvJG8n3MCfG6LGF0te?usp=sharing

STASYA00 commented 4 years ago

Did anyone manage to solve this issue? I'm also training on 16000 dataset. To check the model I trained it just on 12 samples (1 batch) with different parameters using pretrained model. The first one:

"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 800,
"hop_length": 200,
"win_length": 800,

"learning_rate": 1e-5

after 500 epochs the loss starts to increase, all the inferences (500, 1000, ... 5000) give only noise in the output. The second one:

"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,

"learning_rate": 1e-5

Gives audible speech after 500, but there's a lot of noise and it's too fast.

The question is: why does the loss increase? Why does the quality remain the same on the training set and does not improve even though the sample has been seen many times? And how to remove the noise and normalize the audio speed?

xDuck commented 4 years ago

Was anyone able to figure this out? I also tried training 16k from scratch and had the same experience as @mychiux413

adrianastan commented 4 years ago

You can find a model trained from scratch on 21 hours of multispeaker 16kHz data (544000 training steps) here: http://adrianastan.com/models/ . Not as good as the NVIDIA release, but it does the job.

The config is as follows:

{
    "train_config": {
        "fp16_run": true,
        "output_directory": "checkpoints_swara",
        "epochs": 100000,
        "learning_rate": 1e-4,
        "sigma": 1.0,
        "iters_per_checkpoint": 2000,
        "batch_size": 8,
        "seed": 1234,
        "checkpoint_path": "",
        "with_tensorboard": false
    },
    "data_config": {
        "training_files": "train_SWARA.txt",
        "segment_length": 16000,
        "sampling_rate": 16000,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0
    },
    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321"
    },

    "waveglow_config": {
        "n_mel_channels": 80,
        "n_flows": 12,
        "n_group": 8,
        "n_early_every": 4,
        "n_early_size": 2,
        "WN_config": {
            "n_layers": 8,
            "n_channels": 256,
            "kernel_size": 3
        }
    }
}

Perhaps you can warmstart your model from it.

xprilion commented 3 years ago

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

naba89 commented 3 years ago

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

Can you also share your config please.

Merlin-721 commented 3 years ago

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

I get the following exception when loading the model: No module named 'waveglow'

NVIDIA / waveglow

Training waveglow model for 16kHz #215

88 may work