MiniXC / simple_hifigan

2 stars 1 forks source link

RuntimeError #1

Closed KevinWang676 closed 1 year ago

KevinWang676 commented 1 year ago

Hi, I got the error called RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 926, 80, 2]. I wonder how I can fix it. Thanks.

MiniXC commented 1 year ago

Hi! Could you create a minimum example of the code you were running? Were you using wav_to_mel or wavs_to_mel? When using wavs_to_mel, make sure you're using 22050 sampling rate audio with shape [batch_size, samples].

Edit: In case you're using wav_to_mel with audio loaded using torchaudio, you have to get rid of the 0th dimension like so:

mel_spectrogram = synthesiser.wav_to_mel(
    some_audio[0], some_sample_rate
)
KevinWang676 commented 1 year ago

Hi, thanks for your reply. My code is

from simple_hifigan import Synthesiser

synthesiser = Synthesiser()

# create a mel spectrogram
mel_spectrogram = synthesiser.wav_to_mel(
    "audio.wav", 24000
)

audio = synthesiser(mel_spectrogram)

I wonder if it is a correct use. The sample rate of my audio is 24000. Thanks!

MiniXC commented 1 year ago

That should work. I will check, maybe it's a problem in the library. This hasn't been officially released yet, sorry about any bugs like this, and thanks for reporting!

KevinWang676 commented 1 year ago

Thanks for checking it. I didn't load my .wav file using torchaudio. The "audio.wav" in my code is just a path. Can that be a problem?

MiniXC commented 1 year ago

The path alone should work fine. I only tested the method with torchaudio but not with a path, but it has now been fixed in 7d27c8396c8832cb53ecb35b3250a72e4cf3055f

MiniXC commented 1 year ago

I also added your example to https://github.com/MiniXC/simple_hifigan/tree/main/tests In the future I should add some proper tests, but for now it's good to have at least one working example there.

KevinWang676 commented 1 year ago

Thanks for fixing it. You mentioned 22050 sampling rate would work. Would other sampling rates like 24000 also work?

MiniXC commented 1 year ago

Actually any sampling rate should work, but if it's too far away from 22050 there could be issues. When loading from a file, you actually don't need to specify the sampling rate since that information is in the file anyways and wav_to_mel accounts for this.

MiniXC commented 1 year ago

Also note that the synthesised audio will always have a sampling rate of 22050 - this is just how hifigan was trained, so if you need the audio in any other sampling rate you'd have to resample afterwards.

KevinWang676 commented 1 year ago

Got it, thank you so much! I have another question about Hifigan. I wonder what's the use of Hifigan. Is it to denoise the audio or to make the audio quality better?

MiniXC commented 1 year ago

HifiGAN is a vocoder, it converts Mel Spectrograms to Audio Waveforms. You can find more information on vocoders here and on hifigan here

KevinWang676 commented 1 year ago

Thanks, I'll go through the materials.