frank613 / CTC-based-GOP

This repo related to the paper "A Framework for Phoneme-Level Pronunciation Assessment Using CTC" for INTERSPEECH2024
14 stars 3 forks source link

Questions about lpp and lpr Variables and Their Use for Phone Scores #4

Open teinhonglo opened 1 week ago

teinhonglo commented 1 week ago

Hi, thank you for sharing this project! I have a couple of questions about the code at line 194:

Could you clarify the difference between lpp and lpr in this part of the code?

If I want to assign a score to a phone based on the output of this function, which value (or combination of values) would be appropriate to use? Should it be based on lpp, lpr, or some other metric derived from these variables?

Any guidance or additional documentation would be greatly appreciated.

Thanks in advance for your help! Tien-Hong

frank613 commented 1 week ago

Hi Tien-Hong,

I noticed that the version of the script was wrong. It only computes the LPR for the deletion case. I have updated the correct script just now.

Regarding your question. If you only need "a score" than you probably should use "gop-ctc-af-SD.py" that generates a score(a number in the range [-inf,0]) for each phoneme. However, the script that you pointed out is for generating high-dimensional "feature vector" for each phone and each vector has a length of, in this case, |LPP|+|LPR|=41. The feature vector needs to be used as a whole and together with a model that takes all the features as input, for example in the paper the baseline model "support vector regression".

Thanks, Xinwei

frank613 commented 1 week ago

Hi Tien-Hong,

Sorry, I spotted another error in the script. Please wait for a while before inspecting into it.

Thanks, Xinwei

frank613 commented 1 week ago

Hi Tien-hong,

Thanks again for helping us to notice the error. Now I fixed it and added more comments into the script "gop-af-feats.py" illustrating the steps for computing LPP and LPRs. Feel free to ask if you are unsure.

Regards, Xinwei

teinhonglo commented 1 week ago

Hi Xinwei,

Thank you for your prompt response.

I have three additional questions regarding the code:

  1. In generate-GOP-features/gop-af-feats.py, could you clarify which values correspond to LPP and LPR?

  2. In the generate-GOP/ directory, there are three files (gop-ctc-af-S.py, gop-ctc-af-SD.py, and gop-ctc-af-SDI.py). Could you explain the differences between them? (Substitution, Deletion, and Insertion?)

  3. What is the difference between the CTC loss implementation in gop-ctc-af-SD.py and the standard PyTorch CTC loss?

Thank you for your time and assistance. Tien-Hong

frank613 commented 1 week ago

Hi Tien Hong,

1, Did you update the version? I made some comments(Line 112 and Line 123) in the new version to specify where LPP and LPR are computed. 2, The differences are discussed in the paper Sec. 3.2. It is suggested to use gop-ctc-af-SD.py for now only.
3, The numerator is computed with the standard CTC loss(non-normalised version).

Thanks, Xinwei

teinhonglo commented 1 week ago

Hi Xinwei,

Thank you for your previous response. I will review the updated details.

However, while using gop-ctc-af-SD.py to compute scores for some completely correct sentences, I observed that the scores for certain phones—particularly those at the end of the sentence—are unexpectedly low.

Do you have any suggestions on how to address this issue?

The following results are based on the sentence: "Something good just happened."

 'GOP': [['Something',
          [['s', -0.015057360255148922],
           ['ah', -0.0012753405789300842],
           ['m', -2.066839808811949e-05],
           ['th', -0.19092082013669298],
           ['ih', -0.0455496201328458],
           ['ng', -0.022070129649597092]]],
         ['good',
          [['g', -0.004839482367080095],
           ['uh', -0.015431943445063823],
           ['d', -0.058225354422186015]]],
         ['just',
          [['jh', -0.08167516556038379],
           ['ah', -0.009080422959787171],
           ['s', -0.0016102936532984558],
           ['t', -0.004643510046978605]]],
         ['happened',
          [['hh', -0.0011154179743506631],
           ['ae', -0.0064626032916041964],
           ['p', -0.007374408697888413],
           ['ah', -0.006880172592891753],
           ['n', -0.0033417013936851703],
           ['d', -2.9543883905569626]]]],
 'Transcript': 's ah m th ih ng g uh d jh ah s t hh ae p ah n d'

Many thanks, Tien-Hong Lo

frank613 commented 1 week ago

Hi Tien-Hong,

Regarding this issue, here are some tips:

  1. Are you using our model? If so, our model is fine-tuned with Librispeech train-clean-100 which means that the pronunciations of speech, which is considered as "good", should follow the same style (U.S. english, non-accented, non-dialect etc.). It might mismatch with your definition of "completely correct". If that is true, you could try to fine-tune a model with the data in your domain(making sure no error in it). We have provided the script in one of the folders.

  2. The low GOP means the numerator is smaller than the denominator in a certain degree. If you use gop-CTC-AF-SD, In the denominator it contains all the possibility of substitutions, deletion error and the correct one(same as the numerator). You could diagnose yourself with the script "[gop-af-feats.py]" to figure out which index of the feature vector is unexpected large. That is an indication of what kind of error it made. For example, if the second dimension of the vector is relatively high, then indicates it is a deletion error according to the view of the acoustic model. It is because the second dimension of the vector corresponds to the first LPR, which is a deletion, after LPP.

Thanks, Xinwei