astorfi / lip-reading-deeplearning

:unlock: Lip Reading - Cross Audio-Visual Recognition using 3D Architectures
Apache License 2.0
1.84k stars 323 forks source link

Question about speech features #13

Closed bg193 closed 6 years ago

bg193 commented 6 years ago

As you mentioned before "use of non-overlapping hamming windows for generating speech features" , I'm not sure how to do it here. Could you describe in detail the procedures here? Thank a lot!

astorfi commented 6 years ago

The process of generating speech features is using hamming windows for speech feature generation. The windows can be overlapping and non-overlapping.

bg193 commented 6 years ago

@astorfi As you mentioned in your paper, training data use LRW and visual network input 30fps video. But LRW video is 25fps. Has the training data been coverted to 30fps before use to training model?

astorfi commented 6 years ago

I am not quite sure if LRW videos are 25fps. As I remember they were around 29fps. Yes, all of them have been converted to 30fps regardless of the knowledge of initial rate.

astorfi commented 6 years ago

Please refer to this function for having an idea about how the input pipeline works for the speech network.

bg193 commented 6 years ago

@astorfi In your paper said "the temporal features are non-overlapping 20ms windows ".

logenergy = speechpy.feature.lmfe(signal, sampling_frequency=fs, frame_length=0.025, frame_stride=0.01, num_filters=num_coefficient, fft_length=1024, low_frequency=0, high_frequency=None)

Here frame_length=0.025 and frame_stride=0.01 make me confused. Did frame_length is window size and frame_stride=0.01 means overlapping window ? In addition, there's no hamming window or standardization for audio data. I'm most puzzled about these points.

astorfi commented 6 years ago

@xuehui frame_stride=0.01 means overlapping frames. It was just an example in the function regarding input pipeline. The frame_stride=0.025, means non-overlapping. I will change it to not confuse the others. Thank you so much for pointing it out.

ghost commented 6 years ago

@astorfi I am trying to change the parameters in input_feature file so that I have exact simulation environment as your valuable paper. In section A of your paper, you said

the temporal features are non-overlapping 20ms windows

, so I should change frame_length=0.025 and frame_stride=0.01 to the following lines in input_feature:

logenergy = speechpy.feature.lmfe(signal, sampling_frequency=fs,
frame_length=0.020, frame_stride=0.0,
num_filters=num_coefficient, fft_length=1024, low_frequency=0,
high_frequency=None)

Also, you mentioned:

The input speech feature map, which is represented as an image cube, corresponds to the spectogram, as well as the first and second derivatives of the MFEC features,...

assuming that the parameter cube_shape is the aforementioned image cube in the paper, why is it of size (20, 80, 40)? is this just an example? I should use (num_utterance=3,num_frames=15,num_coefficient=40) to have the same parameter setting as the paper? please correct me if I am wrong.

astorfi commented 6 years ago

@nooshin85 I reopened this issue so we can follow this.

astorfi commented 6 years ago

I wanted to mention some modifications:

The input_feature must be changed as follows:

logenergy = speechpy.feature.lmfe(signal, sampling_frequency=fs,
frame_length=0.020, frame_stride=0.020,
num_filters=num_coefficient, fft_length=1024, low_frequency=0,
high_frequency=None)

So the frame_length==frame_stride for non-overlapping scenario.

About the second part, you are absolutely correct. That was just an example. You should modify it as the convenience.

astorfi commented 6 years ago

@nooshin85 I just changed the input_feature.py function for generating the exact shape for the cube.

ghost commented 6 years ago

thanks a lot for the acute response.

astorfi commented 6 years ago

@nooshin85 My pleasure. Please close this issue once you realized that the problem has been resolved. Thanks