Open jayakrishnanmm opened 1 year ago
Hi there,
Please check paper Section 3.1.,
We re-scale utterance and word-level scores to 0-2, making them on the same scale as the phoneme scores.
So all scores should be in the range 0-2 (though outliers are possible).
p
is the phone-level score, should be one value for each phone.
u
is utterance score, from 1-5 should be: https://github.com/YuanGongND/gopt/blob/bed909daf8eca035095871e51642525acc5b9b55/src/traintest.py#L39C5-L39C84
w
is word score, from 1-3 should be:
https://github.com/YuanGongND/gopt/blob/bed909daf8eca035095871e51642525acc5b9b55/src/traintest.py#L41
-Yuan
Hi, Thanks for the clarification. But still didn't get why p,w1-w3 have size of 50. Is this the size of English phones dict ? Where is it defined ?
The word score is propagated to the phone level, i.e., the word scores you get are still at the phone-level, they just get word-level supervision. see https://github.com/YuanGongND/gopt/blob/bed909daf8eca035095871e51642525acc5b9b55/src/prep_data/gen_seq_data_word.py#L31-L55 on how we process that in training.
For inference, it is easy - you just need to average the scores of each word, e.g., for [word1](phone 1,2,3,4) [word2][phone 5,6]
, you will get 4 word scores for word 1; and 2 word scores for word 2, average the 4 scores for word 1 and 2 scores for word 2.
50 is the sequence cutoff length, it is totally not related to the phone vocabulary, which is also not 50.
You would expect the word-level scores and phone-level scores of same length.
got it. How about the unused phone/word level scores ? are they junk values ?
I cannot recall if the code automatically trim the padded tokens, but you should ignore the scores on the padded tokens.
Yes, I did that.
After following the inference steps, I got below values for u1-u5,p, w1-w3
u1=tensor([[1.7443]]) u2=tensor([[1.5404]]) u3=tensor([[1.7297]]) u4=tensor([[1.7074]]) u5=tensor([[1.7606]]) p=tensor([[[1.1559], [1.2266], [1.2165], [1.1115], [1.1052], [1.1074], [1.0690], [1.2223], [1.0949], [1.1671], [1.0795], [1.2557], [1.0595], [1.1116], [1.1818], [1.1300], [1.2001], [1.1101], [1.1616], [1.0864], [1.1390], [0.7162], [0.8037], [0.8568], [0.8601], [0.8054], [0.8418], [0.8683], [0.7827], [0.8825], [0.6441], [0.7901], [0.7464], [0.6433], [0.8020], [0.8223], [0.7503], [0.7563], [0.8885], [0.8561], [0.8105], [0.8625], [0.8481], [0.8317], [0.8435], [0.8590], [0.8139], [0.7567], [0.8845], [0.8129]]]) w1=tensor([[[ 0.1104], [ 0.2297], [ 0.2281], [ 0.0758], [ 0.0577], [ 0.1400], [-0.0202], [ 0.1290], [ 0.0133], [ 0.2836], [ 0.0878], [ 0.3509], [ 0.0595], [ 0.0864], [ 0.1327], [ 0.0924], [ 0.1755], [ 0.0542], [ 0.1502], [ 0.0426], [ 0.1247], [ 0.9526], [ 1.0063], [ 1.0826], [ 1.0663], [ 0.9944], [ 1.0674], [ 1.1030], [ 1.0209], [ 1.0798], [ 0.8870], [ 1.0020], [ 0.9713], [ 0.8827], [ 1.0125], [ 1.0476], [ 0.9834], [ 0.9916], [ 1.1105], [ 1.0714], [ 1.0451], [ 1.0725], [ 1.0760], [ 1.0540], [ 1.0640], [ 1.0696], [ 1.0384], [ 0.9810], [ 1.0873], [ 1.0260]]]) w2=tensor([[[0.6134], [0.7956], [0.9271], [0.6699], [0.5889], [0.6262], [0.4851], [0.6197], [0.5322], [0.9736], [0.7261], [1.0064], [0.5336], [0.6623], [0.6925], [0.6142], [0.7239], [0.5258], [0.6993], [0.5545], [0.7373], [0.9153], [0.9858], [1.0829], [1.0741], [1.0285], [1.0639], [1.0860], [0.9937], [1.1015], [0.8865], [1.0654], [0.9615], [0.9004], [0.9985], [1.0304], [0.9705], [0.9877], [1.0782], [1.0342], [1.0029], [1.0279], [1.0328], [1.0081], [1.0391], [1.0626], [1.0167], [0.9367], [1.0728], [1.0083]]]) w3=tensor([[[0.9717], [1.0951], [1.1173], [0.9834], [0.9371], [0.9385], [0.8971], [1.0128], [0.9022], [1.1262], [0.9963], [1.1767], [0.9003], [0.9701], [0.9989], [0.9520], [1.0238], [0.9401], [1.0122], [0.9360], [1.0347], [1.0048], [1.0965], [1.1611], [1.1419], [1.1097], [1.1247], [1.1732], [1.0983], [1.1891], [0.9894], [1.1176], [1.0471], [0.9793], [1.0938], [1.1114], [1.0798], [1.0866], [1.2085], [1.1529], [1.0992], [1.1474], [1.1448], [1.1297], [1.1249], [1.1632], [1.1026], [1.0581], [1.1813], [1.1074]]])
Now, how to interpret this result ?