SAR-SSL

A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer”

Contributions
- Self-supervised learning of spatial acoustic representation (SSL-SAR)
- first self-supervised learning method in spatial acoustic representation learning and multi-channel audio signal processing
- designs cross-channel signal reconstruction pretext task to learn the spatial acoustic and the spectral pattern information
- learns useful knowledge that can be transferred to the spatial acoustics-related tasks
- Multi-channel audio Conformer (MC-Conformer)
- unified architecture for both the pretext and downstream tasks
- learns the local and global properties of spatial acoustics present in the time-frequency domain
- boosts the performance of both pretext and downstream tasks

Datasets

Source signals: from WSJ0 database
Simulated RIRs: generated by gpuRIR toolbox
Simulated noise: generated by arbitrary noise field generator

Real-world RIRs or microphone signals: from MIR, MeshRIR, DCASE, dEchorate, BUTReverb, ACE, LOCATA, MC-WSJ-AV, LibriCSS, AMIMeeting, AISHELL-4, AliMeeting, RealMAN databases	Datasets	#Room	Microphone Array	#Mic. Pair	#Room x #Source position x #Array position
MIR	3	Three 8-channel linear arrays	60	3 x 26 x 1	W/o
MeshRIR	1	441 microphones	8874	1 x 32 x 1	W/o
DCASE	9	A 4-channel tetrahedral array (EM32)	3	38530	Ambience
dEchorate	11	Six 5-channel linear arrays	48	11 x 3 x 1	Ambience, babble, white
BUTReverb	9	An 8-channel spherical array	28	51	Ambience
ACE	7	A 2-channel array (Chromebook),	433	7 x 1 x 2	Ambience, babble, fan
		a 3-channel right-angled triangle array (Mobile),
		an 8-channel linear array (Lin8Ch),
		a 32-channel spherical array (EM32)
LOCATA	1	A 15-channel linear array (DICIT),	492	Moving/static	Ambience
		a 12-channel robot array (Robot head),
		a 32-channel spherical array (Eigenmike)
MC- WSJ-AV	3	Two 8-channel linear arrays
LibriCSS	1	A 7-channel circular array
AMIMeeting	3	A 8-channel circular array
AISHELL-4	10	A 8-channel circular array
AliMeeting	21	A 8-channel circular array
RealMAN	32	A 32-channel high-precision array

Quick start

Data generation

1. Download datasets to folders according to the following dictionary

  .-SAR-SSL
  | .-code
  | .-data
  | .-exp
  .-data
    .-SrcSig
    | .-wsj0
    |   .-dt
    |   .-et
    |   .-tr
    .-RIR
    | .-Mesh
    | | .-S32-M441_npy
    | .-MIRDB
    | | .-Impulse_response_Acoustic_Lab_Bar-Ilan_University
    | .-DCASE
    | | .-TAU-SRIR_DB
    | | .-TAU-SNoise_DB
    | .-dEchorate
    | | .-dEchorate_database.csv
    | | .-dEchorate_rir.h5
    | | .-dEchorate_annotations.h5
    | | .-dEchorate_noise_gzip7.hdf5
    | | .-dEchorate_babble_gzip7.hdf5
    | | .-dEchorate_silence_gzip7.hdf5
    | .-BUTReverb
    | | .-RIRs
    | .-ACE
    |   .-RIRN
    |   .-Data
    .-MicSig
      .-LOCATA
        .-dev
        .-eval
      .- MC_WSJ_AV
      .- LibriCSS
      .- AMIMeeting
      .- AISHELL-4
      .- AliMeeting
      .- RealMAN

2. Generate room impulse responses or microphone signals

Data for simulated experimets

pre-training

python gen_simu.py --mode sig --stage pretrain --data_num 512000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0,1]
python gen_simu.py --mode sig --stage preval --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]
python gen_simu.py --mode sig --stage pretest --data_num 4000 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]

some test instances

python gen_simu.py --mode sig --stage pretest_ins_T1000 --data_num 10 --room_sz_range [[5,10],[3,6],[2.5,3]] --T60_range [1.0,1.0] --snr_range [20,20] --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu --gpus [0]

downstream training

python gen_simu_certain_room.py --mode sig --stage train --room_num 1000 --sig_num_each_rir 2 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
python gen_simu_certain_room.py --mode sig --stage val --room_num 20 --sig_num_each_rir 1 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds 
python gen_simu_certain_room.py --mode sig --stage test --room_num 20 --sig_num_each_rir 4 --src_dir ../../../data/SrcSig/wsj0 --save_to ../../data/MicSig/simu_ds

Data for real-world experimets

real-wolrld RIR and noise signals

python gen_real_rir.py --dataset DCASE dEchorate BUTReverb ACE --data_type rir noise --read_dir ../../../data/RIR --save_dir ../../data/RIR/real
python gen_real_rir.py --dataset Mesh MIR --data_type rir --read_dir ../../../data/RIR --save_dir ../../data/RIR/real

microphone signals for pre-training with selected RIRs and noise signals

python gen_sig_from_real_rir.py --stage pretrain --dataset Mesh MIR DCASE dEchorate BUTReverb ACE --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real 
python gen_sig_from_real_rir.py --stage preval --dataset DCASE BUTReverb --src_dir ../../../data/SrcSig/wsj0 --rir_dir ../../../data/RIR/real --save_dir ../../data/MicSig/real

additional RIRs for downstream training

python gen_simu_certain_room.py --mode rir --stage train --room_num 1000 --save_to ../../data/RIR/simu

Pretext Task

1. Preparation

Install: numpy, scipy, soundfile, gpuRIR, etc.

2. Training

Simulated experiments

Pretext task: pre-training

python run_pretrain.py --pretrain --simu-exp --gpu-id 0,

Pretext task: evaluation

# * denotes the time version of pre-training model 
python run_pretrain.py --test --simu-exp --time * --gpu-id 0,

Downstream task: training

# --ds-nsimroom: 2, 4, 8, 16, 32, 64, 128 or 256
# --ds-task: TDOA, DRR, T60, C50, or ABS
python run_downstream.py --ds-train --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0, 
python run_downstream.py --ds-train --ds-trainmode scratchLOW --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0, 
python run_downstream.py --ds-train --ds-trainmode lineareval --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,

Stage	Trials	nRooms	nRIRs/Room	nSrcSig/RIR	nMicSig
train	x16	2	50	2	200
	x8	4	50	2	400
	x4	8	50	2	800
	x2	16	50	2	1600
	x1	32	50	2	3200
	x1	64	50	2	6400
	x1	128	50	2	12800
	x1	256	50	2	25600
val	-	20	50	1	1000
test	-	20	50	4	4000

Downstream task: evaluation

# --ds-nsimroom: 2, 4, 8, 16, 32, 64, 128 or 256
# --ds-task: TDOA, DRR, T60, C50, or ABS
python run_downstream.py --ds-test --ds-trainmode finetune --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0, 
python run_downstream.py --ds-test --ds-trainmode scratchLOW --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0, 
python run_downstream.py --ds-test --ds-trainmode lineareval --simu-exp --ds-nsimroom 8 --ds-task TDOA --time * --gpu-id 0,

Real-world experiments

Pretext task:pre-training

when using real-world data, first train on simulated data with a default cosine-decay learing rate (initialized with 0.001), and then finetune on real-world data with a learning rate 0.0001.
```
python run_pretrain.py --pretrain --gpu-id 0, 
```

Pretext task: evaluation

# * denotes the time version of pre-training model 
python run_pretrain.py --test --time * --gpu-id 0,

Downstream task: training

# ds-real-sim-ratio = 1 1, 1 0 or 0 1
python run_downstream.py --ds-train --ds-trainmode finetune --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0, 
python run_downstream.py --ds-train --ds-trainmode scratchLOW --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0,

Downstream task: evaluation

# ds-real-sim-ratio = 1 1, 1 0 or 0 1
python run_downstream.py --ds-test --ds-trainmode finetune --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0, 
python run_downstream.py --ds-test --ds-trainmode scratchLOW --ds-real-sim-ratio 1 0 --ds-task TDOA --time * --gpu-id 0,

read downstream results (MAEs of TDOA, DRR, T60, C50, SNR, ABS estimation) from saved mat files

python read_dsmat_bslr.py --time *
python read_lossmetric_simdata.py
python read_lossmetric_realdata.py

Trained models
- best_model.tar
- ensemble_model.tar

Others

If OSError: [Errno 24] Too many open files occurs, input the following at the command line

  ulimit -n 2048

Citation

If you find our work useful in your research, please consider citing:

@InProceedings{yang2023sarssl,
    author = "Bing Yang and Xiaofei Li",
    title = "Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer",
    booktitle = "arXiv preprint arXiv:2312.00476",
    year = "2023",
    pages = ""}

Licence

MIT

BingYang-20 / SAR-SSL

readme