Open yuan-cherish opened 1 year ago
Hi there,
Thanks for reaching out.
First, I used the following code on my machine to extract one of the wav files of Speechocean762. I used wav.scp to encode, but an encoding error was reported. Specifically, kaldio should use open to open the wav file. Do you know how to fix this?
How did you extract scp file for a single audio? Did you use the Kaldi SO762 recipe? If not, then there's no guarantee that our code can read the scp file. I haven't tried to load a scp file for a single audio file. I uploaded a sample scp file at https://www.dropbox.com/s/53b0p8awapt5f22/feat.scp?dl=1 to help you check this.
A general recommendation is first to follow our instruction step by step, and if success, then modify based on that.
I checked your extracted tr_feats.csv, and I found that each line has 85 dimensions, but why the feature I extracted using torchaudio.compliance.kaldi is not 85 dimensions, and I extracted a wav in the speechocean 762 dataset The length of the file is much greater than 50, and the features of all files in your code are less than 50. I want to know why?
By feature, we meant gop features, not mel fbank features you extracted, please see equation 1-4 in the paper. The feature dimension can be different from 85, depending on your ASR system's phone dictionary size, e.g., PAII-A has a 88 or 86. But for an English ASR system, the difference won't be large. Mel fbank features are totally different as GOP features are output of a (trained) ASR model. The sequence length is the number of words, so it can be any number, we used 50 just because the max sequence length of speechocean 762 dataset is 50.
3、If I just want the model to identify the speaker's fluency, can I make some simplifications, such as not using the ASR model?
This entire work depends on a trained ASR model. I don't have an idea how to skip it.
Hope these helps.
-Yuan
hello author, Recently, when studying the model you proposed, I want to use your pre-trained model to infer my own data, but I have the following confusion:
3、If I just want the model to identify the speaker's fluency, can I make some simplifications, such as not using the ASR model?