Confusion about Prepare data for training

hust-cxl commented 3 years ago

Hi, I'm interested in your paper and I want to re-implement your method recently. But I have some questions when preparing data for training using provided code.

What is speech_meta/farfield_meta? In plan_audio.py, the code like this which means args.farfield or args.speech is a single file and each of both contains many examples because we want to choose k samples randomly In my view, speech_meta and farfield_meta are files generated by previous steps like this However, in plan_farfield.py there is a loop like Firstly, we list all *.wav files in root path, then we need to find another json file whose file name(prefix) is same as variable "path", but where is the json file? Is it generated by plan_speech.py? If the answer is yes, the program plan_speech.py should generate many json files, but this will be contradictory with speech_jsons = rnd.choices(speech_elements, k=len(meta['farfield']['srcs'])) Besides, there is no meta['farfield']['srcs'] in plan_farfield.py.

FrancoisGrondin commented 3 years ago

Thank you for your interest in this work.

The speech meta file is generated using this line:

python plan_speech.py --root <root_speech> --json json/speech.json > <speech_meta>

For instance, in my case I have the folder train-clean-100 of librispeech in the path /media/fgrondin/Data/librispeech/train-clean-100. If I run

python plan_speech.py --root /media/fgrondin/Data/librispeech/train-clean-100/ --json json/speech.json > speech.meta

I get in the file speech.meta the following lines, which list all files with random time offset and fix duration:

...
{"offset": 8.11, "duration": 5.0, "path": "/media/fgrondin/Data/librispeech/train-clean-100/6081/41997/6081-41997-0020.flac"}
{"offset": 1.7, "duration": 5.0, "path": "/media/fgrondin/Data/librispeech/train-clean-100/6081/41997/6081-41997-0037.flac"}
{"offset": 2.1, "duration": 5.0, "path": "/media/fgrondin/Data/librispeech/train-clean-100/6081/41997/6081-41997-0013.flac"}
{"offset": 3.24, "duration": 5.0, "path": "/media/fgrondin/Data/librispeech/train-clean-100/6081/41997/6081-41997-0030.flac"}
{"offset": 5.59, "duration": 5.0, "path": "/media/fgrondin/Data/librispeech/train-clean-100/6081/41997/6081-41997-0016.flac"}
{"offset": 5.02, "duration": 5.0, "path": "/media/fgrondin/Data/librispeech/train-clean-100/6081/41997/6081-41997-0033.flac"}
...

The farfield meta file is generated using this line:

python plan_farfield.py --root <root_farfield> --json json/farfield.json > farfield.meta

For instance, in my case I have the path /media/fgrondin/Scratch/steernet/rirs/train/ that contains all the generated RIRs with Octave. If I run

python plan_farfield.py --root /media/fgrondin/Scratch/steernet/rirs/train/ --json json/farfield.json > farfield.meta

I get in the file farfield.meta the following lines, which list all audio scene (room dimensions, reverberation level, microphone positions, source positions, path to room impulse response signals, etc):

...
{"room": [5.53, 7.6, 2.82], "beta": 0.48, "speed": 349.3, "fs": 16000, "mics": [[3.769, 5.055, 1.303], [3.772, 5.062, 1.344]], "srcs": [[2.664, 0.509, 1.118], [1.049, 1.592, 0.996]], "noise": 0.00132, "snr": [1.6, -3.3], "gain": [1.21, 0.59], "volume": 0.13, "path": "/media/fgrondin/Scratch/steernet/rirs/train/v/g/vgtenpwqrw.wav"}
{"room": [8.31, 7.66, 3.21], "beta": 0.26, "speed": 340.7, "fs": 16000, "mics": [[5.493, 1.995, 0.703], [5.448, 2.011, 0.743]], "srcs": [[2.463, 2.252, 0.962], [2.696, 4.537, 1.298]], "noise": 0.00131, "snr": [-2.0, 0.5], "gain": [1.48, 1.61], "volume": 0.41, "path": "/media/fgrondin/Scratch/steernet/rirs/train/v/g/vgeurymfut.wav"}
{"room": [6.98, 6.01, 2.25], "beta": 0.36, "speed": 345.2, "fs": 16000, "mics": [[2.819, 2.94, 1.212], [2.716, 2.908, 1.165]], "srcs": [[3.369, 4.165, 1.716], [0.572, 4.356, 1.704]], "noise": 0.00148, "snr": [-2.4, 4.8], "gain": [0.79, 1.91], "volume": 0.97, "path": "/media/fgrondin/Scratch/steernet/rirs/train/v/g/vgjuobuigu.wav"}
{"room": [6.79, 7.51, 2.2], "beta": 0.56, "speed": 343.2, "fs": 16000, "mics": [[3.785, 2.958, 0.728], [3.8, 2.898, 0.818]], "srcs": [[2.633, 0.932, 1.572], [6.02, 5.84, 1.669]], "noise": 0.00061, "snr": [-0.4, 4.4], "gain": [0.81, 1.6], "volume": 0.46, "path": "/media/fgrondin/Scratch/steernet/rirs/train/v/g/vgiddmpyhy.wav"}
...

Does this help understanding what is going on with files speech.meta and farfield.meta?