faroit / CountNet

Deep Neural Network for Speaker Count Estimation
https://www.audiolabs-erlangen.de/resources/2017-CountNet
MIT License
145 stars 34 forks source link

Is the training data available? #6

Open dirk61 opened 4 years ago

dirk61 commented 4 years ago

Hey! I really love your work and I'm wondering whether you can provide the training data you synthesized from LibirSpeech clean-360 dataset? That would help a lot!

faroit commented 4 years ago

Hi @dirk61 thanks for your interest. Unfortunately the training data is not available but I am happy to provide further information if needed

dirk61 commented 4 years ago

Thanks. @faroit For the data accessed from clean-360 datasetd, did you simply add up these audios together according to the value of k? Before the transformation to time-frequency matirx, what else did you do to form the wav files for training? I noticed you mentioned peak normalization in the article. I'm new to the audio-processing field and I wonder how that works. What value is set to be the maximum value so that peaks above it are normalized?

faroit commented 4 years ago

For the data accessed from clean-360 datasetd, did you simply add up these audios together according to the value of k?

No, for the time-domain signals there were a few important steps involved to sample the data: We chose random choice of k speakers from the dataset. Each speaker has 5 utterances. These were trimmed so that silence in the beginning and the end of the utterances is removed using a voice activity detection method (I used this one) The utterances were appended in random order. The concatenated utterances were padded with zeros at the end so that all speakers have the same recording length. Finally the utterances were mixed to get the final output

Example:
    A..C = Speaker Id
    1..3 = Utterance Id

    Before padding:
        track1: |---A3---||--A2--||-----A1-----|
        track2: |---B2---||-B1-||--B3--|
        track3: |-------C1------||-C3-||C2|

    After padding:
        track1: |---A3---||--A2--||--A1|
        track2: |---B2---||-B1-||--B3--|
        track3: |-------C1------||-C3-||

  frame count:  |333333333333333333333333
        k: 3

Before the transformation to time-frequency matirx, what else did you do to form the wav files for training?

the mixing was applied by normalizing each track to have the same SNR to each other and the final mix was peak normalized

I noticed you mentioned peak normalization in the article. I'm new to the audio-processing field and I wonder how that works. What value is set to be the maximum value so that peaks above it are normalized?

yes out /= np.max(out, axis=0) does it ;-)

dirk61 commented 4 years ago

Thanks for the detailed explanation! Awesome :)

Each speaker has 5 utterances.

Are these utterances randomly chosen from the clean-360 flac audio files? I noticed in the article you mentioned the mixtures for training all last 10 seconds. The concatenated utterances after padding may last longer than 10s, so just chop it to 10s?

There's another thing in the article I can't quite understand, which is:

In fact, our method to generate synthetic samples results in an average overlap for k = 2 of 85% and for k = 10 of 55% (based on 5s segments).

If randomly chosen, why do these possibilites occur?

faroit commented 3 years ago

@dirk61 sorry for the late reply (feel free to close):

Are these utterances randomly chosen from the clean-360 flac audio files? I noticed in the article you mentioned the mixtures for training all last 10 seconds. The concatenated utterances after padding may last longer than 10s, so just chop it to 10s?

yes, they were chopped.

If randomly chosen, why do these possibilites occur?

not sure if I understand correctly: ideally the overlap should be 100% for all k, but since speakers still make pauses between words, the actual overlap is less than that.