Closed dohuuphu closed 1 year ago
Hi @dohuuphu i am also testing the infer your own voice example, and i am getting the similar results like this.?? Also do you know how can i use 'w3' (i.e total for word score) to get the scores of individual words spoken?? because whenever i print it it gives me 50 lines data which i am confused how to convert to word level???
Hi @dohuuphu i am also testing the infer your own voice example, and i am getting the similar results like this.?? Also do you know how can i use 'w3' (i.e total for word score) to get the scores of individual words spoken?? because whenever i print it it gives me 50 lines data which i am confused how to convert to word level???
Please check: https://github.com/YuanGongND/gopt/issues/16#issuecomment-1478085343
I think there are some problems with Kaldi when trying to generate GOP features. Because I generated feature for speechocean762 dataset and took it to training. All scores at validation phase is nan. If I use public features (dropbox link) then train normally. Strange that I used public Kaldi ASR model and there is no error when creating feature. So I don't understand why my features are different from theirs
Hi @dohuuphu,
You use the same ASR model with us. I assume the GOP feature should be similar at least in statistics - so, could you check the mean/varience/median of your feature and ours? In an extreme case, if your features are all nan, then the output will be nan.
Also - this work was done around September 2021, there might be some change for the Kaldi recipe itself, I think we used this commit - https://github.com/kaldi-asr/kaldi/tree/d6198906fbb0e3cfa5fae313c7126a78d8321801/egs/gop_speechocean762.
-Yuan
Hi all, I tried to read gop.ark after I ran the gop_speechocean762 recipe on a single .wav file. The gop output it is giving is fine i.e. it can clearly tell missed pronunciation words. Same feats are being brought to gopt but it gives same output for every wav. I can use gop_speechocean gop output for pronunciation but to get other features like fluency or completeness I was taking a look at GOPT.
@amandeepbaberwal Thanks for sharing this. You are very correct in that GOP feature itself contains mispronounciation performance. However, in our paper, Table I, column 2&3, we compared the phone-level MSE and CORR between GOP w/ basic classifier and GOPT. GOPT is noticably better.
-Yuan
@amandeepbaberwal And also please be aware of the completeness score almost gets 0 correlation (very bad) due to many reasons. Please see the paper for details.
@YuanGongND Hi can you tell me what #15 is talking about. I am very new to kaldi and GOPT, it is surely a complex tool, Can you explain how did he solve this problem. I don't understand when he talks about
In gen_seq_data_phn.py, tr_label_phn or te_label_phn is generated by the phn_dict that is specific to the dataset that we want to use. However, the pretrain model is based on SpeechOcean762. When trying to inference any other dataset, the model will receive these labels specific to the inference dataset not the SpeechOcean dataset, causing inconsistent inference results.
also what he meant when he said
The correct method is to always generate the phn_dict from the one generated when training the SpeechOcean762. I will update the inference tutorial if you think it is necessary.
This means you need to use same phn_dict as the pretrained model for single wav inference.
For example:
/a/ in our model training maps to 3
But when you pass a single wav to Kaldi without specifying the phn_dict, it will generate a new phn_dict to cover phones in your single wav input, e.g., map /a/ => another value, e.g., 5. Then the trained GOPT model will get confused. The result will be totally wrong.
I apologize, but the main purpose of this repo is research - we want to help other people to reproduce our result. For engineering problems, especially those not mentioned in the paper, we can not provide much support. So I cannot provide further explaination on this.
-Yuan
phn_dict means text-phone file in dataset?
phn_dict means text-phone file in dataset?
I think he mentioned the "phn_dict" variable at gen_seq_data_phn.py
Hi @dohuuphu i am also testing the infer your own voice example, and i am getting the similar results like this.?? Also do you know how can i use 'w3' (i.e total for word score) to get the scores of individual words spoken?? because whenever i print it it gives me 50 lines data which i am confused how to convert to word level???
Have you fixed this, i am also testing the same wav file the the result is similar, still return high score even though it's a bad one
Hi, I am very impressed with the results you have achieved in this project. So I want to reproduce it but have some problem with the result.
I followed the instructions of "[Tutorial] Infer your own data" to get GOPT features from Kaldi and then Inferent with pretrained weight. I tried on many experiments with my own voice and voices of speechocean762 but the results are almost the same (Ex: u1:accuracy always from 1.7 -1.8 (tensor), ...). I also test with incorrect item (voice different with transcript) and the voice with the bad score in speechocean762. But the results are the same as I described above. Hope you can give me some advice.
And one more thing that I do tensor*5 to get the score on ten scale cause tensor_value in range[0-2]. Is it the right way??
Ex: speaker 9604 of speechocean762, this voice have "accuracy": 3 in scores.json -> bad voice but the result is high:
- text-phone: 096040025.0 DH_B EH0_I R_E 096040025.1 W_B AH0_I Z_E 096040025.2 N_B AH1_I TH_I IH0_I NG_E 096040025.3 T_B UW0_E 096040025.4 B_B IY0_E 096040025.5 G_B EY0_I N_I D_E 096040025.6 B_B AY0_E 096040025.7 IH0_B T_E
- spk2age 9604 21
- spk2gender 9604 f
- spk2utt 9604 096040025
- text 096040025 THERE WAS NOTHING TO BE GAINED BY IT
- utt2spk 096040025 9604
- wav 096040025 WAVE/SPEAKER9604/096040025.WAV
=> GOPT features (zip file): feature.zip => Result: u1 = tensor([1.7430]) u2 = tensor([1.5640]) ...
Sao mình infer cả tập data ocean thì kết quả của u1,u2,u3,u4,u5 lúc nào cũng cao vút nhỉ, trong khi trong tập data có cả các mẫu chất luợng kém nữa. Mình dùng cả dữ liệu output của nhóm này nhưng kết quả vẫn vậy, hay là code inference mẫu có vấn đề nhỉ? Không biết bạn fix đuợc lỗi này chưa
Now I'm temporarily not doing this task so I haven't fixed it yet. But as I remember, the problem is come from GOPT feature generation process of Kaldi especially lable_phn. I calculated mean of 2 generated feature, one is given by the repo (public) and the rest is mine by using Kaldi (own). As you can see, there are some different that highlighted and that is the reason I can't reproduce the result of repo.
I think you can debug on Kaldi process to find out what makes it different
Now I'm temporarily not doing this task so I haven't fixed it yet. But as I remember, the problem is come from GOPT feature generation process of Kaldi especially lable_phn. I calculated mean of 2 generated feature, one is given by the repo (public) and the rest is mine by using Kaldi (own). As you can see, there are some different that highlighted and that is the reason I can't reproduce the result of repo.
I think you can debug on Kaldi process to find out what makes it different
But I tried to infer speechocean dataset with this repo’s output and code from inference instruction but the result I got is weird. Values of u1,u2,u3,u4,u5 are always high which is unexpected because there are multiple “bad” examples in the speechocean dataset.
I didn't see any problem with this code If I use the given data. So I have no idea for your case. Sorry about that
I didn't see any problem with this code If I use the given data. So I have no idea for your case. Sorry about that
Thanks for responding. Which python version did you use? And do you install the exact same module versions from requirements.txt?
I used python 3.8.16 and the same lib in requirements
I found that inference code missing norm part for features. So the result is totally error. Need add norm part like valid code so it can work well.
need adddef norm_valid(feat, norm_mean=3.203, norm_std=4.045):
like https://github.com/YuanGongND/gopt/blob/bed909daf8eca035095871e51642525acc5b9b55/src/traintest.py#L351
andt_input_norm_feat = norm_valid(t_input_feat)
before input to model
Hi, I am very impressed with the results you have achieved in this project. So I want to reproduce it but have some problem with the result.
I followed the instructions of "[Tutorial] Infer your own data" to get GOPT features from Kaldi and then Inferent with pretrained weight. I tried on many experiments with my own voice and voices of speechocean762 but the results are almost the same (Ex: u1:accuracy always from 1.7 -1.8 (tensor), ...). I also test with incorrect item (voice different with transcript) and the voice with the bad score in speechocean762. But the results are the same as I described above. Hope you can give me some advice.
And one more thing that I do tensor*5 to get the score on ten scale cause tensor_value in range[0-2]. Is it the right way??
Ex: speaker 9604 of speechocean762, this voice have "accuracy": 3 in scores.json -> bad voice but the result is high:
text-phone: 096040025.0 DH_B EH0_I R_E 096040025.1 W_B AH0_I Z_E 096040025.2 N_B AH1_I TH_I IH0_I NG_E 096040025.3 T_B UW0_E 096040025.4 B_B IY0_E 096040025.5 G_B EY0_I N_I D_E 096040025.6 B_B AY0_E 096040025.7 IH0_B T_E
spk2age 9604 21
spk2gender 9604 f
spk2utt 9604 096040025
text 096040025 THERE WAS NOTHING TO BE GAINED BY IT
utt2spk 096040025 9604
wav 096040025 WAVE/SPEAKER9604/096040025.WAV
=> GOPT features (zip file): feature.zip => Result: u1 = tensor([1.7430]) u2 = tensor([1.5640]) ...