fgnt / sms_wsj

SMS-WSJ: Spatialized Multi-Speaker Wall Street Journal database for multi-channel source separation and recognition
MIT License
110 stars 25 forks source link

Write padded source signal to the disk and rename unpadded signal to original_source #9

Closed boeddeker closed 3 years ago

boeddeker commented 3 years ago

First, thanks to @Emrys365 for reporting, that we don't write the padded source signal to the disk.

In the python code is an option, that the source_signal gets padded, when the example is computed on demand. The padding is a pretty cheap operation and saves disk space:

A problem occurs, when users just want to use the written files and not our proposed python code.

This PR addresses this problem and writes the padded source signal to the disk. Now we also want to distinguish between the unpadded and the padded source signal.

After an internal discussion, we think the following names for them are reasonable:

Since most people are interested in the speech_source, this won't be a breaking change for them. Furthermore, I tried to make the python code to be backward compatiple.

boeddeker commented 3 years ago

One example now looks as follows (I removed the transcription and the path prefix):


            "0_4k6c0303_4k4c0319": {
                "room_dimensions": [[8.169], [5.905], [3.073]],
                "sound_decay_time": 0.387,
                "source_position": [[3.312, 3.0], [1.921, 2.379], [1.557, 1.557]],
                "sensor_position": [[4.015, 3.973, 4.03, 4.129, 4.172, 4.115],
                                    [3.265, 3.175, 3.093, 3.102, 3.192, 3.274],
                                    [1.55, 1.556, 1.563, 1.563, 1.558, 1.551]],
                "example_id": "0",
                "num_speakers": 2,
                "speaker_id": ["4k6", "4k4"],
                "source_id": ["4k6c0303", "4k4c0319"],
                "gender": ["male", "female"],
                "kaldi_transcription": [
                    "...",
                    "..."
                ],
                "log_weights": [0.9885484337248203, -0.9885484337248203],
                "num_samples": {
                    "original_source": [31633, 93389],
                    "observation": 93389
                },
                "offset": [52476, 0],
                "audio_path": {
                    "original_source": [
                        ".../sms_wsj/cache/wsj_8k_zeromean/13-16.1/wsj1/si_dt_20/4k6/4k6c0303.wav",
                        ".../sms_wsj/cache/wsj_8k_zeromean/13-16.1/wsj1/si_dt_20/4k4/4k4c0319.wav"
                    ],
                    "rir": [
                        ".../sms_wsj/cache/rirs/cv_dev93/0/h_0.wav",
                        ".../sms_wsj/cache/rirs/cv_dev93/0/h_1.wav"
                    ],
                    "speech_reverberation_early": [
                        ".../sms_wsj/cache/early/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
                        ".../sms_wsj/cache/early/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
                    ],
                    "speech_reverberation_tail": [
                        ".../sms_wsj/cache/tail/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
                        ".../sms_wsj/cache/tail/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
                    ],
                    "noise_image": ".../sms_wsj/cache/noise/cv_dev93/0_4k6c0303_4k4c0319.wav",
                    "observation": ".../sms_wsj/cache/observation/cv_dev93/0_4k6c0303_4k4c0319.wav",
                    "speech_source": [
                        ".../sms_wsj/cache/speech_source/cv_dev93/0_4k6c0303_4k4c0319_0.wav",
                        ".../sms_wsj/cache/speech_source/cv_dev93/0_4k6c0303_4k4c0319_1.wav"
                    ]
                },
                "snr": 23.287502642941252
            },

In my tests it now works. Note: speech_source is in the json, but the AudioReader will ignore this key, because original_source is cheaper to read and the calculation of speech_source is cheap.