Audio-WestlakeU / FullSubNet

PyTorch implementation of "FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."
https://fullsubnet.readthedocs.io/en/latest/
MIT License
535 stars 153 forks source link

Does the Pretrained Model available in releases folder works with 48k sampling rate? #7

Open yugeshav opened 3 years ago

yugeshav commented 3 years ago

@haoxiangsnr

Hello,

FullSubNet model works with 48k sampling rate in inferencing time?

Regards Yugesh

haoxiangsnr commented 3 years ago

Hi, you need to downsample to 16K first.

yugeshav commented 3 years ago

Hi, you need to downsample to 16K first.

Does your model has any option to resample the audio data?

haoxiangsnr commented 3 years ago

Maybe you could use sox for resampling. Here is an example of how to do it:

sox filename.wav -r 16000 filename_16000.wav

Check this link for more info: https://stackoverflow.com/questions/23980283/sox-resample-and-convert

haoxiangsnr commented 3 years ago

Sorry, I think you can directly use the FullSubNet model to enhance the 48K wav file in inferencing time.

Check this line of the project. When loading, Librosa will resample the wav file to 16K, regardless of the original sampling rate.

However, you should note that after enhancement, the saved wav file is 16K.

yugeshav commented 3 years ago

Sorry, I think you can directly use the FullSubNet model to enhance the 48K wav file in inferencing time.

Check this line of the project. When loading, Librosa will resample the wav file to 16K, regardless of the original sampling rate.

However, you should note that after enhancement, the saved wav file is 16K.

Thanks for the details, I tried inferencing 48k audio file and saved output in 16k, but observed quality of the speech is completely missed, sometimes no speech also. Is this expected behavior of your model?

haoxiangsnr commented 3 years ago

Could you please send me the wav file and the inference config?

yugeshav commented 3 years ago

Could you please send me the wav file and the inference config?

Input file uploaded in this link [https://drive.google.com/file/d/1UVejws8QuAtDWuA3cyCU6nMNp1Gv2E-L/view?usp=sharing]

Code changes are in config/inference/fullsubnet.toml

inherit = "config/common/fullsubnet_inference.toml" [dataset] path = "dataset.DNS_INTERSPEECH_inference.Dataset" [dataset.args] noisy_dataset = "/root/data_3tb_2/Experiments_Yugesh/Yugesh_FSN/FullSubNet-main/rc14_48k" limit = false offset = 0 sr = 48000

In src/inferencer/DNS_INTERSPEECH.py Line 162

op_dir = "/root/data_3tb_2/Experiments_Yugesh/Yugesh_FSN/FullSubNet-main/outputs" op_dir = op_dir + '/'+name+'.wav' sf.write(op_dir, enhanced, samplerate=16000)

haoxiangsnr commented 3 years ago

You will get the correct result by changing sr = 48000 to sr = 16000 in the inference/fullsubnet.toml, I presume?

Considering that sr = 48000, Librosa will load wav files by resampling the original sampling rate (in your case, 48K) to 48K (means no change). However, the pred-trained model is for wav files with 16K.

If you set sr = 16000, Librosa will load wav files by resampling the original sampling rate (in this case, 48K) to 16K.

yugeshav commented 3 years ago

You will get the correct result by changing sr = 48000 to sr = 16000 in the inference/fullsubnet.toml, I presume?

Considering that sr = 48000, Librosa will load wav files by resampling the original sampling rate (in your case, 48K) to 48K (means no change). However, the pred-trained model is for wav files with 16K.

If you set sr = 16000, Librosa will load wav files by resampling the original sampling rate (in this case, 48K) to 16K.

Okay, Then fullsubnet model only able to process 16k inputs. if we give 48k then librosa will take care of resampling conversion???

Thanks a lot for the detailed info @haoxiangsnr

ahmedbahaaeldin commented 3 years ago

@yugeshav can you share the pretrained model ?

yugeshav commented 3 years ago

The pre-trained model is in here: https://github.com/haoxiangsnr/FullSubNet/releases

On Wed, Mar 10, 2021, 2:08 PM ahmedbahaaeldin notifications@github.com wrote:

@yugeshav https://github.com/yugeshav can you share the pretrained model ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/haoxiangsnr/FullSubNet/issues/7#issuecomment-795088846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCASOR5LNQYQUONDLSU4DTTC4VXJANCNFSM4X74AECQ .

ahmedbahaaeldin commented 3 years ago

@yugeshav which one from the archive/data file should i pick for the best performance ?

yugeshav commented 3 years ago

As per the author, it is fullsubnet.

On Wed, Mar 10, 2021, 5:19 PM ahmedbahaaeldin notifications@github.com wrote:

@yugeshav https://github.com/yugeshav which one from the archive/data file should i pick for the best performance ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/haoxiangsnr/FullSubNet/issues/7#issuecomment-795303936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCASOVHRFM3YFNALIJBQZLTC5ME5ANCNFSM4X74AECQ .

ahmedbahaaeldin commented 3 years ago

@yugeshav I changed the input to 16k sample rate , reshaped it to (1,1,257,-1) and forward through the network , the output shape is (1,2,257,-1) , is this the correct way to use it , cause the sound output is noise ? or their should be some preprocessing ?? @haoxiangsnr