bagustris / SER_ICSigSys2019

Repository of code for Speech emotion recognition using voiced speech and attention model, submitted to ICSigSys 2019
13 stars 4 forks source link

model.add(LSTM(512, return_sequences=True, input_shape=(100, 34))) #2

Open jinsple opened 3 years ago

jinsple commented 3 years ago

Why input_shape equals to (100, 34)? 100 means time_steps? How to understand it? Thank you very much!

bagustris commented 3 years ago

Yes. 100 is timesteps and 34 is the number of features. The acoustic feature is extracted per frame. So, for one utterance I limited the number of frames to 100 frames. In other words, my input feature size for each utterance is 100 rows and 34 columns or each row has 34 values.

jinsple commented 3 years ago

I am very glad to receive your reply. I got it. Thank you very much! I want to consult you that how long is per frame. Between 20 and 30 ms?

bagustris commented 3 years ago

Typically the frame between 15-30 ms. But in this paper I used 200 ms. See the save_feature.py code in this repo (window_n = 0.2).

So it makes sense that the number of time steps is only 100 sequences.

In my other research, I used 25 ms of frame resulting around 3000 timesteps, for instance here: https://github.com/bagustris/ravdess_song_speech/