NVIDIA / nv-wavenet

Reference implementation of real-time autoregressive wavenet inference
BSD 3-Clause "New" or "Revised" License
735 stars 126 forks source link

Error of mismatching sampling rate with tacotron2 and nv-wavenet #34

Closed yhgon closed 6 years ago

yhgon commented 6 years ago

for tacotron training with LJ Dataset, we use 22K sampling and nv-wavenet pytorch implementation only support 16k sampling

yhgon commented 6 years ago

detail error infomation Case1. input dataset would be 22K sampling wav. config.config option for 16K sampling as below error :

Traceback (most recent call last):
  File "train.py", line 197, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 132, in train
    for i, batch in enumerate(train_loader):
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 281, in __next__
    return self._process_next_batch(batch)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 301, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
IndexError: Traceback (most recent call last):
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 55, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 55, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/scratch/github/NVIDIA/nv-wavenet/pytorch/mel2samp_onehot.py", line 79, in __getitem__
    sampling_rate, self.sampling_rate))
IndexError: tuple index out of range

output directory checkpoints-2018-0601-lj
Epoch: 0

Case2. input dataset would be 22K sampling wav. config.json option for 22K sampling as below error

output directory checkpoints-2018-0601-lj
Epoch: 0
Traceback (most recent call last):
  File "train.py", line 197, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 140, in train
    y_pred = model(x)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/scratch/github/NVIDIA/nv-wavenet/pytorch/wavenet.py", line 107, in forward
    assert(cond_input.size(2) >= forward_input.size(1))
AssertionError
rafaelvalle commented 6 years ago

@yhgon the tacotron2 repo supports any sampling rate. just update the sampling_rate and other parameters accordingly to match your wavenet https://github.com/NVIDIA/tacotron2/blob/master/hparams.py

yhgon commented 6 years ago

@rafaelvalle Current pytorch nv-wavenet implementation don't support Param2 case and supported below Param1 case as you mentioned . I worry about the quality of voice. I think below Param2 case would be best quality but current pytorch nv-wavenet implementation could not support below configuration as I mentioned in Case 2 error . What is your recommendation for training configuration for both of tacotron2 + nv-wavnet in down-sampling case? Anyway, What is your opinion about loosing quality Param1 and Param3 case?

down sampling both of training | Param1 | tacotron2 | NV-wavenet | | sampling_rate | 16K | 16K |
| segment_length | 16K | 16K |
| filter_length | 800 | 800 | | hop_length | 200 | 200 | | win_length | 800 | 800 | | win_length | 800 | 800 | | mel_channels |80 | 80 |

matching all parameters | Param2 | tacotron2 | NV-wavenet | | sampling_rate | 22K | 22K |
| segment_length | 22K | 22K |
| filter_length | 1024 | 1024 | | hop_length | 256 | 256 | | win_length | 1024 | 1024 | | win_length | 1024 | 1024 | | mel_channels |80 | 80 |

down sampling during inferencing : Param3 : tacotron2 : NV-wavenet : | sampling_rate | 22K | 16K | | segment_length | 22K | 16K |
| filter_length | 1024 | 800 | | hop_length | 256 | 200 | | win_length | 1024 | 800 | | win_length | 1024 | 800 | | mel_channels |80 | 80 |

rafaelvalle commented 6 years ago

one should match all parameters. nv-wavenet parameters can be set on this file: https://github.com/NVIDIA/nv-wavenet/blob/master/pytorch/config.json The most relevant params are: stride, win_length, sampling_rate, upsamp_window, upsamp_stride

zhf459 commented 6 years ago

@yhgon When you train with LJSpeech , nv-wavenet requires that assert(cond_input.size(2) >= forward_input.size(1)),which means nv-wavenet condition upsample outputs's length have to be >= segment_length(T). Suppose condition size is(k, condition_channel),win_length is kernel_size, hop_length is the stride_size, so you have to satisfy this (k-1)*kernel_size + stride_size>=segment_length no padding, you can check torch.nn.ConvTranspose1d

yhgon commented 6 years ago

@rafaelvalle @zhf459 thanks for your comment. I found the reason why I have problem. when I set up sampling_rate 22050 instead of 22000, it works well.

{
    "train_config": {
        "output_directory": "checkpoints-2018-0604-lj-22k",
        "epochs": 100000,
        "learning_rate": 1e-3,
        "iters_per_checkpoint": 1000,
        "batch_size": 12,
        "seed": 1234,
        "checkpoint_path": ""
    },

    "data_config": {
        "training_files": "lj_train_files.txt",
        "segment_length": 22050,
        "mu_quantization": 256,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "sampling_rate": 22050
    },

    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321"
    },

    "wavenet_config": {
        "n_in_channels": 256,
        "n_layers": 16,
        "max_dilation": 128,
        "n_residual_channels": 64,
        "n_skip_channels": 256,
        "n_out_channels": 256,
        "n_cond_channels": 80,
        "upsamp_window": 1024,
        "upsamp_stride": 256
rafaelvalle commented 6 years ago

@yhgon please close the issue if it is resolved.

yhgon commented 6 years ago

Resolved with right config.