Open jinsple opened 3 years ago
Yes. 100 is timesteps and 34 is the number of features. The acoustic feature is extracted per frame. So, for one utterance I limited the number of frames to 100 frames. In other words, my input feature size for each utterance is 100 rows and 34 columns or each row has 34 values.
I am very glad to receive your reply. I got it. Thank you very much! I want to consult you that how long is per frame. Between 20 and 30 ms?
Typically the frame between 15-30 ms. But in this paper I used 200 ms. See the save_feature.py code in this repo (window_n = 0.2).
So it makes sense that the number of time steps is only 100 sequences.
In my other research, I used 25 ms of frame resulting around 3000 timesteps, for instance here: https://github.com/bagustris/ravdess_song_speech/
Why input_shape equals to (100, 34)? 100 means time_steps? How to understand it? Thank you very much!