This is the code and dataset repo for Interspeech 2024 paper "Target conversation extraction: Source separation using turn-taking dynamics". Project webpage and the audio samples can be found in our webpage.
This task is used to extract the conversation in noisy environment given the embedding/enrollment of one speaker in the conversation. In other word, this paper tried to solve the new problem "who talk with me?".
For example, in the above image, the goal of target conversation extraction in this illustration is as follows: given a clean enrollment audio or embedding for B, we want to extract audio for the conversation between A, B and C, amidst interference from speaker D.
Our dataset contains We first pre-train the model with synthetic conversational dataset from non-conversational dataset like LibriTTS(English) and Aishell-3(Mandrain). We also use WHAM dataset as the background noise
The real conversational datasets we used are AMI Corpus(English) and ASR-RAMC(Mandrain)
# generate AMI dataset
python datasets/generate_dataset.py ./datasets/AMI.json $SAVE_FOLDER --n_outputs_train 8000 --n_outputs_val 1000
# generate ASR-RAMC dataset
python datasets/generate_dataset.py ./datasets/ASR.json $SAVE_FOLDER --n_outputs_train 8000 --n_outputs_val 1000
# generate synthetic dataset using LibriTTS
python convert_AMI2Libri.py --data_dir /scr/Noreverb_ASR/ --save_dir /scr/ASR_Libri --replace_prob 1
# generate synthetic dataset cross languages
python convert_ASR2aishell1.py --data_dir /scr/Noreverb_ASR --save_dir /scr/ASR2AISHELL --replace_prob 0.5
Synthesize dataset: synthetic dataset using LibriTTS: syn_libri.tar synthetic dataset cross languages: syn_cross.tar.gz
Augmented dataset: augmented dataset for AMI aug_eng.tar.gz augmented dataset for ASR-RAMC aug_zh.tar.gz
Real-recorded dataset: real dataset for AMI: real_eng.tar.gz real dataset for ASR-RAMC real_zh.tar.gz
Model after pretrained stage: Pretrain
Model finetune for English (AMI): En-AMI
Model finetune for Mandrain (ASR-RAMC): Mn-ASR-RAMC
Model finetune with Candor dataset for better performance for real-world demo: En-Candor
Our model is based on TF-Gridnet. To handle long sequnce, we improved the efficienct by chunkized LSTM and pooling attention.
python src/train.py --config ./config/pretrain.json --run_dir $CHECKPOINT_FOLDER_PRE
Finetune the conversation model for English
python src/train.py --config ./config/finetune_English.json --run_dir $CHECKPOINT_FOLDER_ENG
Finetune the conversation model for Mandarain
python src/train.py --config ./experiment/finetune_Mandarain.json --run_dir $CHECKPOINT_FOLDER_MND
Evaluate on the English conversation
python eval_conversation.py ./Noreverb_AMI/test/ $CHECKPOINT_FOLDER_ENG --use_cuda
Evaluate on the Mandarain conversation
python eval_conversation.py ./Noreverb_ASR/test/ $CHECKPOINT_FOLDER_MND --use_cuda
The evaluation script will output sample-wise result as csv file and save to folder "./output". To do analysis on the output cvf files, run
python plot_result.py $CVS_FILE_PATH