Closed boeddeker closed 3 years ago
One example now looks as follows (I removed the transcription and the path prefix):
"0_4k6c0303_4k4c0319": {
"room_dimensions": [[8.169], [5.905], [3.073]],
"sound_decay_time": 0.387,
"source_position": [[3.312, 3.0], [1.921, 2.379], [1.557, 1.557]],
"sensor_position": [[4.015, 3.973, 4.03, 4.129, 4.172, 4.115],
[3.265, 3.175, 3.093, 3.102, 3.192, 3.274],
[1.55, 1.556, 1.563, 1.563, 1.558, 1.551]],
"example_id": "0",
"num_speakers": 2,
"speaker_id": ["4k6", "4k4"],
"source_id": ["4k6c0303", "4k4c0319"],
"gender": ["male", "female"],
"kaldi_transcription": [
"...",
"..."
],
"log_weights": [0.9885484337248203, -0.9885484337248203],
"num_samples": {
"original_source": [31633, 93389],
"observation": 93389
},
"offset": [52476, 0],
"audio_path": {
"original_source": [
".../sms_wsj/cache/wsj_8k_zeromean/13-16.1/wsj1/si_dt_20/4k6/4k6c0303.wav",
".../sms_wsj/cache/wsj_8k_zeromean/13-16.1/wsj1/si_dt_20/4k4/4k4c0319.wav"
],
"rir": [
".../sms_wsj/cache/rirs/cv_dev93/0/h_0.wav",
".../sms_wsj/cache/rirs/cv_dev93/0/h_1.wav"
],
"speech_reverberation_early": [
".../sms_wsj/cache/early/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
".../sms_wsj/cache/early/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
],
"speech_reverberation_tail": [
".../sms_wsj/cache/tail/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
".../sms_wsj/cache/tail/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
],
"noise_image": ".../sms_wsj/cache/noise/cv_dev93/0_4k6c0303_4k4c0319.wav",
"observation": ".../sms_wsj/cache/observation/cv_dev93/0_4k6c0303_4k4c0319.wav",
"speech_source": [
".../sms_wsj/cache/speech_source/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
".../sms_wsj/cache/speech_source/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
]
},
"snr": 23.287502642941252
},
In my tests it now works.
Note: speech_source
is in the json, but the AudioReader
will ignore this key, because original_source
is cheaper to read and the calculation of speech_source
is cheap.
First, thanks to @Emrys365 for reporting, that we don't write the padded source signal to the disk.
In the python code is an option, that the
source_signal
gets padded, when the example is computed on demand. The padding is a pretty cheap operation and saves disk space:A problem occurs, when users just want to use the written files and not our proposed python code.
This PR addresses this problem and writes the padded source signal to the disk. Now we also want to distinguish between the unpadded and the padded source signal.
After an internal discussion, we think the following names for them are reasonable:
original_source
: The unchanged WSJ signal, except for the sample rate, that should matchspeech_source
: The signal that is reverberated and used to create the observation. So this is the paddedoriginal_source
.Since most people are interested in the
speech_source
, this won't be a breaking change for them. Furthermore, I tried to make the python code to be backward compatiple.