BUTSpeechFIT / EEND

70 stars 10 forks source link

Format of input data #5

Closed ooobsidian closed 1 year ago

ooobsidian commented 1 year ago

Excellent code! I see that the input data needs to be filled in with "\<Kaldi data directory for train set>", but I don't know how to prepare the data for the kaldi type.

I look forward to hearing from you

fnlandini commented 1 year ago

Hi, thank you for the feedback What I meant with that is that you need to pass a directory that contains files reco2dur, rttm, segments, spk2utt, utt2spk, wav.scp following the usual format as done in the Kaldi toolkit. For telephone data, you can look at this script. However, we have prepared recently other examples on public data such as LibriSpeech and you can find the script here which I believe is easier to follow because there is a single corpus. Let me know if this helps.

ooobsidian commented 1 year ago

Thank you for your reply. By the way, can EEND directly use the original speech wav and RTTM for supervised training, or what needs to be modified for adaptation, because I noticed that this warehouse is implemented based on Pytorch.

fnlandini commented 1 year ago

Hi, I am not sure if I understand so correct me if I am wrong if I understood you wrong. This model is fully trained supervisedly so there is no pre-training that requires later adaptation. However, the model is trained on synthetic data first and then there is a fine-tuning step where one re-trains the model using (a small learning rate and) a small development set of real data. However, this is not related to the model being implemented in pytorch. This was the same in the original implementation in chainer.

ooobsidian commented 1 year ago

Thank you for your reply. I mean, can we train without converting the original data to Kaldi format data, or what changes should be made to the code?

fnlandini commented 1 year ago

You do not need to convert any data at all. By "Kaldi data directory" I mean that you need to create the files I mentioned before (reco2dur, rttm, segments, spk2utt, utt2spk, wav.scp) which basically list the files and segments of speech that one needs as training data. But the data are not changed. I will have to point you again to the scripts I mentioned before. You can study them or run them to get better understanding about how you need to read your data to produce the needed files.

ooobsidian commented 1 year ago

Thank you very much for your patience!

fnlandini commented 1 year ago

Closing. Feel free to reopen if you see fit.