====================================================================================
Two-speaker speech separation with BLSTM and PIT
Author: aishoot, EECS, Peking University
Github: https://github.com/aishoot/LSTM_PIT_Speech_Separation
Created in: June 2018
====================================================================================
The progress made in multitalker mixed speech separation and recognition, often referred to as the "cocktail-party problem", has been less impressive. Although human listeners can easily perceive separate sources in an acoustic mixture, the same task seems to be extremely difficult for computers, especially when only a single microphone recording the mixed-speech.
Notice: The training set and the validation set that contain two-speaker mixtures generated by randomly selecting speakers and utterances from the WSJ0 set, and mixing them at various signal-to-noise ratios (SNRs) uniformly chosen between -2.5 dB and 2.5 dB.
The separation performance of LSTM are as follows:
Gender Combination | SDR | SAR | SIR | STOI | ESTOI | PESQ |
---|---|---|---|---|---|---|
Overall | 6.453328 | 9.372059 | 11.570311 | 0.473229 | 0.377204 | 1.5812 |
Male & Female | 8.238905 | 9.939668 | 14.531649 | 0.488542 | 0.393999 | 1.663442 |
Female & Female | 3.538810 | 8.134054 | 7.230494 | 0.459762 | 0.363213 | 1.478075 |
Male & Male | 5.011563 | 9.026763 | 9.000010 | 0.456667 | 0.358757 | 1.602058 |
The separation performance of BLSTM are as follows:
Gender Combination | SDR | SAR | SIR | STOI | ESTOI | PESQ |
---|---|---|---|---|---|---|
Overall | 9.177447 | 10.629142 | 16.116564 | 0.536987 | 0.429255 | 1.65339 |
Male & Female | 10.647645 | 11.691969 | 18.203052 | 0.521656 | 0.421868 | 1.731112 |
Female & Female | 7.309365 | 9.393608 | 13.355384 | 0.560099 | 0.441704 | 1.553452 |
Male & Male | 7.797448 | 9.589827 | 14.198003 | 0.550071 | 0.435083 | 1.675609 |
From above results we can see that the separation effect of mixed gender audio is better than that of the same gender and BLSTM performs better than LSTM.
When you have WSJ0 data, you can use the code "create-speaker-mixtures-V1/V2" to create the mixed speech. We mixed 2-speaker audios with samplerate 8000.
bash run.sh
which contains three steps:
storage/
├── lists
│ ├── cv_tf.lst
│ ├── cv_wav.lst
│ ├── tr_tf.lst
│ ├── tr_wav.lst
│ ├── tt_tf.lst
│ └── tt_wav.lst
├── separated
├── TFCheckpoint
└── tfrecords
├── cv_tfrecord
│ ├── 01aa010k_1.3053_01po0310_-1.3053.tfrecords
│ ├── 01aa010p_0.93798_02bo0311_-0.93798.tfrecords
│ ├── ...
│ └── 409o0317_1.2437_025c0217_-1.2437.tfrecords
├── tr_tfrecord
│ ├── 01aa010b_0.97482_209a010p_-0.97482.tfrecords
│ ├── 01aa010b_1.4476_20aa010p_-1.4476.tfrecords
│ ├── ...
│ └── 409o0316_1.3942_20oo010p_-1.3942.tfrecords
└── tt_tfrecord
├── 050a050a_0.032494_446o030v_-0.032494.tfrecords
├── 050a050a_1.7521_422c020j_-1.7521.tfrecords
├── ...
└── 447o0312_2.0302_440c0206_-2.0302.tfrecords
Note: {tr,cv,tt}_wav.lst is like as follows:
447o030v_0.1232_050c0109_-0.1232.wav
447o030v_1.7882_444o0310_-1.7882.wav
...
447o030x_0.98832_441o0308_-0.98832.wav
447o030x_1.4783_422o030p_-1.4783.wav
And {tr,cv,tt}_tf.lst is like as follows:
storage/tfrecords/cv_tfrecord/011o031b_1.8_206a010u_-1.8.tfrecords
storage/tfrecords/cv_tfrecord/20ec0109_0.47371_020c020q_-0.47371.tfrecords
...
storage/tfrecords/cv_tfrecord/01zo030l_0.6242_40ho030s_-0.6242.tfrecords
storage/tfrecords/cv_tfrecord/20fo0109_1.1429_017o030p_-1.1429.tfrecords
mixed speech:
masks:
recovered speech 1:
recovered speech 2:
Thank Dong Yu et al. for the paper and Sining Sun (Northwestern Polytechnical University, China) et al. for sharing their code.
Thanks for your attention!