Open mcernak opened 6 years ago
Hi,
So, we have two tasks, one for accent recognition and the other for phoneme recognition. So, we have created two directories, the ones with "101-recog-min" or "101-recognition" in their names contains the data or alignments for the "phoneme recognition" task, and the ones with "102-class-min" in their names contains data or alignments for "accent recognition" task. The data directories for both are same but their alignments are different. For "phoneme recognition" task alignments are pointing to phoneme ids but for "accent recognition" task alignments are pointing to accent ids.
Regarding data-splits: "cv_train_nz" is Train-7 "cv_dev_nz" is Dev-4 "cv_test_onlynz" is Test-NZ Please ignore "cv_trainx_nz". This is same as "cv_train_nz" i.e. Train-7. This is because the data directories for both the tasks are same, the difference comes in alignments as stated earlier.
Thanks,
We have also updated the README for better understanding. It should also resolve future queries.
Hi, thanks.
However, although the splits are now clarified, the scripts cannot be followed. You use some other models, such as /home/abhinav/kaldi/accents/exp
. Howe did you train it, on what data?
Why didn't you commit the data preparation scripts, incl. how did you create the alignments and so far?
How did you prepare the lang data?
Without data preparation, nothing can be replicated.
Best, Milos
Hi Milos, The entire work published is actually divided into three repositories. One of which is this one. You can refer the readme of all the repos in sequence which can give you an overview of what is needed to be done. It should be sufficient to reproduce the results with some prior knowledge. Moreover, we wanted to make the modeling related scripts available soon as possible. A clean version of the entire pipeline is in the works and willtake some time before being ready for public use. We are working on releasing this soon.
Hello,
Thank you very much for preparing this reproducible research.
You said in your IS2018 paper that you use this data split: https://sites.google.com/view/accentsunearthed-dhvani/ So train, dev, test, testindian, testnz subsets.
As the data preparation steps are missing, it is hard to guess which data sets you actually used. For example, following
multitask_run_2_base_2.sh
, it is not clear what data isdata/101-recog-min
,data/102-cla-min
and these subsetscv_train_nz
,cv_trainx_nz
,cv_dev_nz
,cv_test_onlynz
. Could you please make it clear?Thanks, Milos