BUTSpeechFIT / EEND_dataprep

49 stars 7 forks source link

Example input of processing segment part in generate_data.sh #1

Closed kli017 closed 2 years ago

kli017 commented 2 years ago

Hello, interesting work, I want to apply this code on a custom dataset to get the statistcs and generate simulate conversation data. But I have some problem to understand some operation in the code. Because I do not have the SRE and Switchboard dataset. I cannot understand the code from 87 to 108 (processing segment). I noticed that the awk code for each dataset is different, and dont know which one is suitable for my data. I was wondering if you can provide some example files of input. The format of my segment file is like : speaker_id wav_id start_time durationtime aaaa wav1 1.0 2.3

fnlandini commented 2 years ago

Hi, thank you for the interest. That is a very good question, indeed and sorry it was not clear from the code. Your format is almost the same as is expected. As an example, $SEG_LIST_FILE contains lines like

100304-f-sre2006 100304-f-sre2006-kacg-A 0.00 2.20
100304-f-sre2006 100304-f-sre2006-kacg-A 2.67 6.09
100304-f-sre2006 100304-f-sre2006-kacg-A 6.57 10.05
100304-f-sre2006 100304-f-sre2006-kacg-A 10.05 10.72
100304-f-sre2006 100304-f-sre2006-kacg-A 10.80 16.27
100304-f-sre2006 100304-f-sre2006-kacg-A 16.52 22.12
100304-f-sre2006 100304-f-sre2006-kacg-A 22.66 25.15
100304-f-sre2006 100304-f-sre2006-kacg-A 25.34 28.86
100304-f-sre2006 100304-f-sre2006-kacg-A 29.05 29.79
100304-f-sre2006 100304-f-sre2006-kacg-A 30.19 33.55

where they represent speaker_id, wav_id, start and end times. So you should only modify your last column.

The logic in that block of code (lines 87 to 108) gathers the segments and generates train and validation lists so you should only adapt lines 87 to 98.

I hope this helps.

kli017 commented 2 years ago

Thank you for the help, It's clear now, I will try with my data.

kli017 commented 2 years ago

@fnlandini hello, I met some error while prepare my custom simu conversation. For the line 125 in conv_generator.py : selected_speakers = np.random.choice(speakers, nspks, replace=False) The speakers is a list, And I got an error: File "./conv_generator.py", line 127, in speakers, nspks, replace=False) File "mtrand.pyx", line 904, in numpy.random.mtrand.RandomState.choice ValueError: a must be 1-dimensional

Someone said it might because of the version of Numpy, My version is 1.19.2.

kli017 commented 2 years ago

solved by replace line 125 by index = np.random.choice(len(speakers), nspks, replace=False) selected_speakers = [speakers[idx] for idx in index]

Jamiroquai88 commented 2 years ago

@fnlandini does it make sense to replace this $SEG_LIST_FILE format with the loading of utt2spk/spk2utt? the format is strangely similar to segments and it looks like people are confused by this (I was too as you know). Not sure if I don't see some issues with this, let me know.

fnlandini commented 2 years ago

@Jamiroquai88 $SEG_LIST_FILE has the Kaldi segments format if I'm not mistaken. I'm not sure if I understood the question

Jamiroquai88 commented 2 years ago

In the comment above you said that $SEG_LIST_FILE has columns: speaker_id, wav_id, start and end times While segments file has columns: segment_id, wav_id, start, end

I am just saying, that we don't need to create a new file but rather use utt2spk/spk2utt to map from segment_id to speaker_id.

fnlandini commented 2 years ago

@Jamiroquai88 You are right, it could be possible to handle that in the code rather than creating an extra file. For the time being, I'll keep it as is but thanks for the suggestion, I'll try to fix it in the future as others might find it strange as well

kli017 commented 2 years ago

@fnlandini Hello, For the $SEG_LIST_FILE example you give, does one wav_id only have one speaker in that wav?

fnlandini commented 2 years ago

@kli017 Yes, it is expected to have one speaker per wav

kli017 commented 2 years ago

@fnlandini ok thank you for the quick reply 👍