Open dirk61 opened 4 years ago
Hi @dirk61 thanks for your interest. Unfortunately the training data is not available but I am happy to provide further information if needed
Thanks. @faroit For the data accessed from clean-360 datasetd, did you simply add up these audios together according to the value of k? Before the transformation to time-frequency matirx, what else did you do to form the wav files for training? I noticed you mentioned peak normalization in the article. I'm new to the audio-processing field and I wonder how that works. What value is set to be the maximum value so that peaks above it are normalized?
For the data accessed from clean-360 datasetd, did you simply add up these audios together according to the value of k?
No, for the time-domain signals there were a few important steps involved to sample the data:
We chose random choice of k
speakers from the dataset. Each speaker has 5 utterances. These were trimmed so that silence in the beginning and the end of the utterances is removed using a voice activity detection method (I used this one) The utterances were appended in random order. The concatenated utterances were padded with zeros at the end so that all speakers have the same recording length. Finally the utterances were mixed to get the final output
Example:
A..C = Speaker Id
1..3 = Utterance Id
Before padding:
track1: |---A3---||--A2--||-----A1-----|
track2: |---B2---||-B1-||--B3--|
track3: |-------C1------||-C3-||C2|
After padding:
track1: |---A3---||--A2--||--A1|
track2: |---B2---||-B1-||--B3--|
track3: |-------C1------||-C3-||
frame count: |333333333333333333333333
k: 3
Before the transformation to time-frequency matirx, what else did you do to form the wav files for training?
the mixing was applied by normalizing each track to have the same SNR to each other and the final mix was peak normalized
I noticed you mentioned peak normalization in the article. I'm new to the audio-processing field and I wonder how that works. What value is set to be the maximum value so that peaks above it are normalized?
yes out /= np.max(out, axis=0)
does it ;-)
Thanks for the detailed explanation! Awesome :)
Each speaker has 5 utterances.
Are these utterances randomly chosen from the clean-360
flac audio files? I noticed in the article you mentioned the mixtures for training all last 10 seconds. The concatenated utterances after padding may last longer than 10s, so just chop it to 10s?
There's another thing in the article I can't quite understand, which is:
In fact, our method to generate synthetic samples results in an average overlap for k = 2 of 85% and for k = 10 of 55% (based on 5s segments).
If randomly chosen, why do these possibilites occur?
@dirk61 sorry for the late reply (feel free to close):
Are these utterances randomly chosen from the clean-360 flac audio files? I noticed in the article you mentioned the mixtures for training all last 10 seconds. The concatenated utterances after padding may last longer than 10s, so just chop it to 10s?
yes, they were chopped.
If randomly chosen, why do these possibilites occur?
not sure if I understand correctly: ideally the overlap should be 100% for all k, but since speakers still make pauses between words, the actual overlap is less than that.
Hey! I really love your work and I'm wondering whether you can provide the training data you synthesized from LibirSpeech clean-360 dataset? That would help a lot!