HawkAaron / E2E-ASR

PyTorch Implementations for End-to-End Automatic Speech Recognition
126 stars 27 forks source link

Question on feature transform #11

Closed zhoudoufu closed 4 years ago

zhoudoufu commented 5 years ago

@HawkAaron Hi, I have a question about your code on feature transform part. According to Alex Graves 2013 paper, the feature applied is described as

The audio data was encoded using a Fourier-transform-based filter-bank with 40 coefficients (plus energy) distributed on a mel-scale, together with their first and second temporal derivatives. Each input vector was therefore size 123. The data were normalised so that every element of the input vectors had zero mean and unit variance over the training set.

In your code DataLoader.py the feature transform part is :

copy-feats scp:data_timit/{}/feats.scp ark:- | apply-cmvn --utt2spk=ark:data_timit/{}/utt2spk scp:data_timit/{}/cmvn.scp ark:- ark:- |\
 add-deltas --delta-order=2 ark:- ark:- | nnet-forward data_timit/final.feature_transform ark:- ark:- 

Correct me if I make an error here, I think the feature transform is already accomplished before the nnet-forward command. So why did you use a nnet to make the feature embedding?

When I look into the feature_transform.sh, I got more confused that the net-forward part seems to be another feature normalization all over again, can you explain a little bit for this part? Thx

HawkAaron commented 5 years ago

apply-cmvn --utt2spk is to make speaker cmn; nnet-forward feature_transform is to make global cmvn.