YongyuG / rnnoise_16k

implementation of rnnoise_16k
BSD 3-Clause "New" or "Revised" License
123 stars 40 forks source link

Training with 8khz audio #12

Closed wuwenshan closed 2 years ago

wuwenshan commented 3 years ago

Hi guys,

I would like to know what changes do I need to do if I want to train a model with 8khz audio, I tried to change some parameters in the denoise.c like this

#define BLOCK_SIZE 8000

#define FRAME_SIZE (20<<FRAME_SIZE_SHIFT)
#define SAMPLE_RATE 8000

#define PITCH_MIN_PERIOD 10
#define PITCH_MAX_PERIOD 128
#define PITCH_FRAME_SIZE 160 

#if SMOOTH_BANDS
//#define NB_BANDS 22
#define NB_BANDS 14
#else
//#define NB_BANDS 21
#define NB_BANDS 13
#endif

After compiling, I got a matrix with shape 500000 x 63, the number of features (63) is for me normal because we have less samples, and even the training is doing well, I do not get high loss or what, it seems to learn something. But at the end of the day, I get sizzle audio and it doesn't seems to do anything with noise reduction.

Sorry for bothering you again @YongyuG, can you provide me some informations about the code, how did you manage to turn it in 16khz and according to you, is it possible or not to turn it in 8khz.

YongyuG commented 3 years ago

Hi guys,

I would like to know what changes do I need to do if I want to train a model with 8khz audio, I tried to change some parameters in the denoise.c like this

#define BLOCK_SIZE 8000

#define FRAME_SIZE (20<<FRAME_SIZE_SHIFT)
#define SAMPLE_RATE 8000

#define PITCH_MIN_PERIOD 10
#define PITCH_MAX_PERIOD 128
#define PITCH_FRAME_SIZE 160 

#if SMOOTH_BANDS
//#define NB_BANDS 22
#define NB_BANDS 14
#else
//#define NB_BANDS 21
#define NB_BANDS 13
#endif

After compiling, I got a matrix with shape 500000 x 63, the number of features (63) is for me normal because we have less samples, and even the training is doing well, I do not get high loss or what, it seems to learn something. But at the end of the day, I get sizzle audio and it doesn't seems to do anything with noise reduction.

Sorry for bothering you again @YongyuG, can you provide me some informations about the code, how did you manage to turn it in 16khz and according to you, is it possible or not to turn it in 8khz.

Hi, sorry for late reply.

  1. It seems like not enough training samples. 500000 means only 13.9hrs samples been trained. you can do some data augmentation like add reverberation or for each clean sample, mix with 10 or 20 randomly select samples to aug your datasets
  2. 8k signal has less features comparing with 16k, you need to change the model structure to make sure your model can capture high dimension space feature. Because original rnnoise model was far too simple.
  3. I recommend you using real-time features generated from encoder-decoder architecture and using u-net structures to get higher dimension. some paper you can refer to : "DCCRN: Deep complex convolution re-current network for phase-aware speech enhancement" "Attention wave-u-net forspeech enhancement" "A perceptually-motivated approach for low-complexity, real-time enhancement of fullband speech" "A Causal U-net based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement" These kinds of stuffs are what we are doing right now. The performance are better than rnnoise