NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.44k stars 3.21k forks source link

[HIFIGAN] How to train a model of 44100 sampling rate? #1277

Open godspirit00 opened 1 year ago

godspirit00 commented 1 year ago

I tried to set the related arguments in train.py as --sampling_rate 44100 --filter_length 2048 --hop_length 512 --win_length 2048, but got the following error:

train.py:412: UserWarning: Using a target size (torch.Size([24, 80, 8])) that is different to the input size (torch.Size([24, 80, 16])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss_mel = F.l1_loss(y_mel, y_g_hat_mel) * 45
Traceback (most recent call last):
  File "train.py", line 507, in <module>
    main()
  File "train.py", line 412, in main
    loss_mel = F.l1_loss(y_mel, y_g_hat_mel) * 45
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 3230, in l1_loss
    expanded_input, expanded_target = torch.broadcast_tensors(input, target)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/functional.py", line 75, in broadcast_tensors
    return _VF.broadcast_tensors(tensors)  # type: ignore[attr-defined]
RuntimeError: The size of tensor a (16) must match the size of tensor b (8) at non-singleton dimension 2

So how can I train a model of 44100 sampling rate? Thank you.

itamar-dw commented 1 year ago

You need to set the sampling rate also when creating the Mel spectrogram features from raw audio. These were probably created using sampling rate of 22050Hz so you get a factor of 2 in the number of windows (8 vs 16, as seen in the error message)