A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer”
Contributions
Self-supervised learning of spatial acoustic representation (SSL-SAR)
first self-supervised learning method in spatial acoustic representation learning and multi-channel audio signal processing
designs cross-channel signal reconstruction pretext task to learn the spatial acoustic and the spectral pattern information
learns useful knowledge that can be transferred to the spatial acoustics-related tasks
Multi-channel audio Conformer (MC-Conformer)
unified architecture for both the pretext and downstream tasks
learns the local and global properties of spatial acoustics present in the time-frequency domain
boosts the performance of both pretext and downstream tasks
Real-world RIRs or microphone signals: from MIR, MeshRIR, DCASE, dEchorate, BUTReverb, ACE, LOCATA, MC-WSJ-AV, LibriCSS, AMIMeeting, AISHELL-4, AliMeeting, RealMAN databases | Datasets | #Room | Microphone Array | #Mic. Pair | #Room x #Source position x #Array position | Noise Type |
---|---|---|---|---|---|---|
MIR | 3 | Three 8-channel linear arrays | 60 | 3 x 26 x 1 | W/o | |
MeshRIR | 1 | 441 microphones | 8874 | 1 x 32 x 1 | W/o | |
DCASE | 9 | A 4-channel tetrahedral array (EM32) | 3 | 38530 | Ambience | |
dEchorate | 11 | Six 5-channel linear arrays | 48 | 11 x 3 x 1 | Ambience, babble, white | |
BUTReverb | 9 | An 8-channel spherical array | 28 | 51 | Ambience | |
ACE | 7 | A 2-channel array (Chromebook), | 433 | 7 x 1 x 2 | Ambience, babble, fan | |
a 3-channel right-angled triangle array (Mobile), | ||||||
an 8-channel linear array (Lin8Ch), | ||||||
a 32-channel spherical array (EM32) | ||||||
LOCATA | 1 | A 15-channel linear array (DICIT), | 492 | Moving/static | Ambience | |
a 12-channel robot array (Robot head), | ||||||
a 32-channel spherical array (Eigenmike) | ||||||
MC- WSJ-AV | 3 | Two 8-channel linear arrays | ||||
LibriCSS | 1 | A 7-channel circular array | ||||
AMIMeeting | 3 | A 8-channel circular array | ||||
AISHELL-4 | 10 | A 8-channel circular array | ||||
AliMeeting | 21 | A 8-channel circular array | ||||
RealMAN | 32 | A 32-channel high-precision array |
1. Download datasets to folders according to the following dictionary
.-SAR-SSL
| .-code
| .-data
| .-exp
.-data
.-SrcSig
| .-wsj0
| .-dt
| .-et
| .-tr
.-RIR
| .-Mesh
| | .-S32-M441_npy
| .-MIRDB
| | .-Impulse_response_Acoustic_Lab_Bar-Ilan_University
| .-DCASE
| | .-TAU-SRIR_DB
| | .-TAU-SNoise_DB
| .-dEchorate
| | .-dEchorate_database.csv
| | .-dEchorate_rir.h5
| | .-dEchorate_annotations.h5
| | .-dEchorate_noise_gzip7.hdf5
| | .-dEchorate_babble_gzip7.hdf5
| | .-dEchorate_silence_gzip7.hdf5
| .-BUTReverb
| | .-RIRs
| .-ACE
| .-RIRN
| .-Data
.-MicSig
.-LOCATA
.-dev
.-eval
.- MC_WSJ_AV
.- LibriCSS
.- AMIMeeting
.- AISHELL-4
.- AliMeeting
.- RealMAN
2. Generate room impulse responses or microphone signals
Data for simulated experimets
python gen_simu.py --mode sig --stage pretrain --data_num 512000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0,1]
python gen_simu.py --mode sig --stage preval --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
python gen_simu.py --mode sig --stage pretest --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
python gen_simu.py --mode sig --stage pretest_ins_T1000 --data_num 10 --room_sz_range [[5,10],[3,6],[2.5,3]] --T60_range [1.0,1.0] --snr_range [20,20] --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
python gen_simu_certain_room.py --mode sig --stage train --room_num 1000 --sig_num_each_rir 2 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds
python gen_simu_certain_room.py --mode sig --stage val --room_num 20 --sig_num_each_rir 1 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds
python gen_simu_certain_room.py --mode sig --stage test --room_num 20 --sig_num_each_rir 4 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds
Data for real-world experimets
python gen_real_rir.py --dataset DCASE dEchorate BUTReverb ACE --data_type rir noise --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
python gen_real_rir.py --dataset Mesh MIR --data_type rir --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
python gen_sig_from_real_rir.py --stage pretrain --dataset Mesh MIR DCASE dEchorate BUTReverb ACE --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real
python gen_sig_from_real_rir.py --stage preval --dataset DCASE BUTReverb --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real
python gen_simu_certain_room.py --mode rir --stage train --room_num 1000 --save_to ../../data/RIR/simu
1. Preparation
2. Training
Simulated experiments
Pretext task: pre-training
python run_pretrain.py --pretrain --simu-exp --gpu-id 0,
Pretext task: evaluation
# * denotes the time version of pre-training model
python run_pretrain.py --test --simu-exp --time * --gpu-id 0,
Downstream task: training
# --ds-nsimroom: 2, 4, 8, 16, 32, 64, 128 or 256
# --ds-task: TDOA, DRR, T60, C50, or ABS
python run_downstream.py --ds-train --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,
python run_downstream.py --ds-train --ds-trainmode scratchLOW --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,
python run_downstream.py --ds-train --ds-trainmode lineareval --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,
Stage | Trials | nRooms | nRIRs/Room | nSrcSig/RIR | nMicSig |
---|---|---|---|---|---|
train | x16 | 2 | 50 | 2 | 200 |
x8 | 4 | 50 | 2 | 400 | |
x4 | 8 | 50 | 2 | 800 | |
x2 | 16 | 50 | 2 | 1600 | |
x1 | 32 | 50 | 2 | 3200 | |
x1 | 64 | 50 | 2 | 6400 | |
x1 | 128 | 50 | 2 | 12800 | |
x1 | 256 | 50 | 2 | 25600 | |
val | - | 20 | 50 | 1 | 1000 |
test | - | 20 | 50 | 4 | 4000 |
# --ds-nsimroom: 2, 4, 8, 16, 32, 64, 128 or 256
# --ds-task: TDOA, DRR, T60, C50, or ABS
python run_downstream.py --ds-test --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,
python run_downstream.py --ds-test --ds-trainmode scratchLOW --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,
python run_downstream.py --ds-test --ds-trainmode lineareval --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,
Real-world experiments
Pretext task:pre-training
when using real-world data, first train on simulated data with a default cosine-decay learing rate (initialized with 0.001), and then finetune on real-world data with a learning rate 0.0001.
python run_pretrain.py --pretrain --gpu-id 0,
Pretext task: evaluation
# * denotes the time version of pre-training model
python run_pretrain.py --test --time * --gpu-id 0,
Downstream task: training
# ds-real-sim-ratio = 1 1, 1 0 or 0 1
python run_downstream.py --ds-train --ds-trainmode finetune --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0,
python run_downstream.py --ds-train --ds-trainmode scratchLOW --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0,
Downstream task: evaluation
# ds-real-sim-ratio = 1 1, 1 0 or 0 1
python run_downstream.py --ds-test --ds-trainmode finetune --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0,
python run_downstream.py --ds-test --ds-trainmode scratchLOW --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0,
read downstream results (MAEs of TDOA, DRR, T60, C50, SNR, ABS estimation) from saved mat files
python read_dsmat_bslr.py --time *
python read_lossmetric_simdata.py
python read_lossmetric_realdata.py
Trained models
If OSError: [Errno 24] Too many open files
occurs, input the following at the command line
ulimit -n 2048
If you find our work useful in your research, please consider citing:
@InProceedings{yang2023sarssl,
author = "Bing Yang and Xiaofei Li",
title = "Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer",
booktitle = "arXiv preprint arXiv:2312.00476",
year = "2023",
pages = ""}
MIT