Write padded source signal to the disk and rename unpadded signal to original_source

First, thanks to @Emrys365 for reporting, that we don't write the padded source signal to the disk.

In the python code is an option, that the source_signal gets padded, when the example is computed on demand. The padding is a pretty cheap operation and saves disk space:

Only the original WSJ utterance must be saved on the disk.
Each utterance is used multiple times (e.g. 2 times in the training dataset).
- So the padded source signal take at least 2 times more disk space as the unpadded signal (Ignoring the required disk space for the padded part).

A problem occurs, when users just want to use the written files and not our proposed python code.

This PR addresses this problem and writes the padded source signal to the disk. Now we also want to distinguish between the unpadded and the padded source signal.

After an internal discussion, we think the following names for them are reasonable:

original_source: The unchanged WSJ signal, except for the sample rate, that should match
speech_source: The signal that is reverberated and used to create the observation. So this is the padded original_source.

Since most people are interested in the speech_source, this won't be a breaking change for them. Furthermore, I tried to make the python code to be backward compatiple.

One example now looks as follows (I removed the transcription and the path prefix):


            "0_4k6c0303_4k4c0319": {
                "room_dimensions": [[8.169], [5.905], [3.073]],
                "sound_decay_time": 0.387,
                "source_position": [[3.312, 3.0], [1.921, 2.379], [1.557, 1.557]],
                "sensor_position": [[4.015, 3.973, 4.03, 4.129, 4.172, 4.115],
                                    [3.265, 3.175, 3.093, 3.102, 3.192, 3.274],
                                    [1.55, 1.556, 1.563, 1.563, 1.558, 1.551]],
                "example_id": "0",
                "num_speakers": 2,
                "speaker_id": ["4k6", "4k4"],
                "source_id": ["4k6c0303", "4k4c0319"],
                "gender": ["male", "female"],
                "kaldi_transcription": [
                    "...",
                    "..."
                ],
                "log_weights": [0.9885484337248203, -0.9885484337248203],
                "num_samples": {
                    "original_source": [31633, 93389],
                    "observation": 93389
                },
                "offset": [52476, 0],
                "audio_path": {
                    "original_source": [
                        ".../sms_wsj/cache/wsj_8k_zeromean/13-16.1/wsj1/si_dt_20/4k6/4k6c0303.wav",
                        ".../sms_wsj/cache/wsj_8k_zeromean/13-16.1/wsj1/si_dt_20/4k4/4k4c0319.wav"
                    ],
                    "rir": [
                        ".../sms_wsj/cache/rirs/cv_dev93/0/h_0.wav",
                        ".../sms_wsj/cache/rirs/cv_dev93/0/h_1.wav"
                    ],
                    "speech_reverberation_early": [
                        ".../sms_wsj/cache/early/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
                        ".../sms_wsj/cache/early/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
                    ],
                    "speech_reverberation_tail": [
                        ".../sms_wsj/cache/tail/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
                        ".../sms_wsj/cache/tail/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
                    ],
                    "noise_image": ".../sms_wsj/cache/noise/cv_dev93/0_4k6c0303_4k4c0319.wav",
                    "observation": ".../sms_wsj/cache/observation/cv_dev93/0_4k6c0303_4k4c0319.wav",
                    "speech_source": [
                        ".../sms_wsj/cache/speech_source/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
                        ".../sms_wsj/cache/speech_source/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
                    ]
                },
                "snr": 23.287502642941252
            },

In my tests it now works. Note: speech_source is in the json, but the AudioReader will ignore this key, because original_source is cheaper to read and the calculation of speech_source is cheap.

fgnt / sms_wsj

Write padded source signal to the disk and rename unpadded signal to original_source #9