Audio-WestlakeU / FullSubNet

PyTorch implementation of "FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."
https://fullsubnet.readthedocs.io/en/latest/
MIT License
554 stars 158 forks source link

Training and Validation cRM Mismatch #46

Closed jhkonan closed 2 years ago

jhkonan commented 2 years ago

During training, with batch size 10, we observe the following shapes:

cRM torch.Size([10, 128, 193, 2])
noisy_real torch.Size([10, 257, 193])
noisy_imag torch.Size([10, 257, 193])

However, during validation, we see:

cRM torch.Size([1, 257, 626, 2])
noisy_real torch.Size([1, 257, 626])
noisy_imag torch.Size([1, 257, 626])

Why is dimension 1 and 2 of the cRM different during training but not during validation?

Without these, I am unable to get the enhanced waveform during training, since this calculation fails:

cRM = decompress_cIRM(cRM)

enhanced_real = cRM[..., 0] * noisy_real - cRM[..., 1] * noisy_imag
enhanced_imag = cRM[..., 1] * noisy_real + cRM[..., 0] * noisy_imag
haoxiangsnr commented 2 years ago

During the training, there is an additional drop_band function will be performed. Since the adjacent sub-band features are similar, during the training of the sub-band model, the FullSubNet will reasonably drop a half in sub-band features. It aims to speed up training without significant performance degradation. Check here for more details.

If you want to hack training, change num_groups_in_drop_band (here) to 1 is the simplest method to archive the dimension you desired.

jhkonan commented 2 years ago

Thank you for the swift response. This does the trick, but you are right about the speed -- I need to half my batch size and training takes twice as long. Could we instead use duplicates of the sub-band features during training? Or, maybe get the enhancement using a lower resolution noisy?

I will try to understand this better while training using your suggestion. I appreciate your help.