Open yugeshav opened 3 years ago
Hi, you need to downsample to 16K first.
Hi, you need to downsample to 16K first.
Does your model has any option to resample the audio data?
Maybe you could use sox
for resampling. Here is an example of how to do it:
sox filename.wav -r 16000 filename_16000.wav
Check this link for more info: https://stackoverflow.com/questions/23980283/sox-resample-and-convert
Sorry, I think you can directly use the FullSubNet model to enhance the 48K wav file in inferencing time.
Check this line of the project. When loading, Librosa will resample the wav file to 16K, regardless of the original sampling rate.
However, you should note that after enhancement, the saved wav file is 16K.
Sorry, I think you can directly use the FullSubNet model to enhance the 48K wav file in inferencing time.
Check this line of the project. When loading, Librosa will resample the wav file to 16K, regardless of the original sampling rate.
However, you should note that after enhancement, the saved wav file is 16K.
Thanks for the details, I tried inferencing 48k audio file and saved output in 16k, but observed quality of the speech is completely missed, sometimes no speech also. Is this expected behavior of your model?
Could you please send me the wav file and the inference config?
Could you please send me the wav file and the inference config?
Input file uploaded in this link [https://drive.google.com/file/d/1UVejws8QuAtDWuA3cyCU6nMNp1Gv2E-L/view?usp=sharing]
Code changes are in config/inference/fullsubnet.toml
inherit = "config/common/fullsubnet_inference.toml" [dataset] path = "dataset.DNS_INTERSPEECH_inference.Dataset" [dataset.args] noisy_dataset = "/root/data_3tb_2/Experiments_Yugesh/Yugesh_FSN/FullSubNet-main/rc14_48k" limit = false offset = 0 sr = 48000
In src/inferencer/DNS_INTERSPEECH.py Line 162
op_dir = "/root/data_3tb_2/Experiments_Yugesh/Yugesh_FSN/FullSubNet-main/outputs" op_dir = op_dir + '/'+name+'.wav' sf.write(op_dir, enhanced, samplerate=16000)
You will get the correct result by changing sr = 48000
to sr = 16000
in the inference/fullsubnet.toml
, I presume?
Considering that sr = 48000
, Librosa will load wav files by resampling the original sampling rate (in your case, 48K) to 48K (means no change). However, the pred-trained model is for wav files with 16K.
If you set sr = 16000
, Librosa will load wav files by resampling the original sampling rate (in this case, 48K) to 16K.
You will get the correct result by changing
sr = 48000
tosr = 16000
in theinference/fullsubnet.toml
, I presume?Considering that
sr = 48000
, Librosa will load wav files by resampling the original sampling rate (in your case, 48K) to 48K (means no change). However, the pred-trained model is for wav files with 16K.If you set
sr = 16000
, Librosa will load wav files by resampling the original sampling rate (in this case, 48K) to 16K.
Okay, Then fullsubnet model only able to process 16k inputs. if we give 48k then librosa will take care of resampling conversion???
Thanks a lot for the detailed info @haoxiangsnr
@yugeshav can you share the pretrained model ?
The pre-trained model is in here: https://github.com/haoxiangsnr/FullSubNet/releases
On Wed, Mar 10, 2021, 2:08 PM ahmedbahaaeldin notifications@github.com wrote:
@yugeshav https://github.com/yugeshav can you share the pretrained model ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/haoxiangsnr/FullSubNet/issues/7#issuecomment-795088846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCASOR5LNQYQUONDLSU4DTTC4VXJANCNFSM4X74AECQ .
@yugeshav which one from the archive/data file should i pick for the best performance ?
As per the author, it is fullsubnet.
On Wed, Mar 10, 2021, 5:19 PM ahmedbahaaeldin notifications@github.com wrote:
@yugeshav https://github.com/yugeshav which one from the archive/data file should i pick for the best performance ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/haoxiangsnr/FullSubNet/issues/7#issuecomment-795303936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHCASOVHRFM3YFNALIJBQZLTC5ME5ANCNFSM4X74AECQ .
@yugeshav I changed the input to 16k sample rate , reshaped it to (1,1,257,-1) and forward through the network , the output shape is (1,2,257,-1) , is this the correct way to use it , cause the sound output is noise ? or their should be some preprocessing ?? @haoxiangsnr
@haoxiangsnr
Hello,
FullSubNet model works with 48k sampling rate in inferencing time?
Regards Yugesh