Xiaobin-Rong / gtcrn

The official implementation of GTCRN, an ultra-lite speech enhancement model.
MIT License
219 stars 37 forks source link

inference error #32

Closed yujinqiu closed 3 months ago

yujinqiu commented 3 months ago

With project included mix.wav, I can get enh.wav without error. But when I change to my voice file, I got Runtime Error

Traceback (most recent call last):
  File "/Volumes/Coding/gtcrn/infer.py", line 18, in <module>
    input = torch.stft(torch.from_numpy(mix), 512, 256, 512, torch.hann_window(512).pow(0.5), return_complex=False)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/denoise/lib/python3.10/site-packages/torch/functional.py", line 693, in stft
    input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/denoise/lib/python3.10/site-packages/torch/nn/functional.py", line 4369, in _pad
    return torch._C._nn.reflection_pad1d(input, pad)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 230400, 2]

ffprobe mix.wav

Input #0, wav, from 'mix.wav':
  Duration: 00:00:09.77, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

ffprobe my voice file (elevenlabs16k.wav)

Input #0, wav, from 'elevenlabs16k.wav':
  Metadata:
    encoder         : Lavf61.1.100
  Duration: 00:00:14.40, bitrate: 512 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 2 channels, s16, 512 kb/s

here is my voice file elevenlabs16k.wav.zip hope it can help to debug the issue, unzip it first (Github not allow upload wav file directly).

Xiaobin-Rong commented 3 months ago

@yujinqiu Is this possibly because your audio file is stereo? And the model only supports mono input.

yujinqiu commented 3 months ago

Thanks, convert audio to mono fix my issue.