Closed bg193 closed 6 years ago
The process of generating speech features is using hamming windows for speech feature generation. The windows can be overlapping and non-overlapping.
@astorfi As you mentioned in your paper, training data use LRW and visual network input 30fps video. But LRW video is 25fps. Has the training data been coverted to 30fps before use to training model?
I am not quite sure if LRW videos are 25fps. As I remember they were around 29fps. Yes, all of them have been converted to 30fps regardless of the knowledge of initial rate.
Please refer to this function for having an idea about how the input pipeline works for the speech network.
@astorfi In your paper said "the temporal features are non-overlapping 20ms windows ".
logenergy = speechpy.feature.lmfe(signal, sampling_frequency=fs, frame_length=0.025, frame_stride=0.01, num_filters=num_coefficient, fft_length=1024, low_frequency=0, high_frequency=None)
Here frame_length=0.025 and frame_stride=0.01 make me confused. Did frame_length is window size and frame_stride=0.01 means overlapping window ? In addition, there's no hamming window or standardization for audio data. I'm most puzzled about these points.
@xuehui frame_stride=0.01 means overlapping frames. It was just an example in the function regarding input pipeline. The frame_stride=0.025, means non-overlapping. I will change it to not confuse the others. Thank you so much for pointing it out.
@astorfi I am trying to change the parameters in input_feature file so that I have exact simulation environment as your valuable paper. In section A of your paper, you said
the temporal features are non-overlapping 20ms windows
, so I should change frame_length=0.025 and frame_stride=0.01 to the following lines in input_feature:
logenergy = speechpy.feature.lmfe(signal, sampling_frequency=fs,
frame_length=0.020, frame_stride=0.0,
num_filters=num_coefficient, fft_length=1024, low_frequency=0,
high_frequency=None)
Also, you mentioned:
The input speech feature map, which is represented as an image cube, corresponds to the spectogram, as well as the first and second derivatives of the MFEC features,...
assuming that the parameter cube_shape is the aforementioned image cube in the paper, why is it of size (20, 80, 40)? is this just an example? I should use (num_utterance=3,num_frames=15,num_coefficient=40) to have the same parameter setting as the paper? please correct me if I am wrong.
@nooshin85 I reopened this issue so we can follow this.
I wanted to mention some modifications:
The input_feature must be changed as follows:
logenergy = speechpy.feature.lmfe(signal, sampling_frequency=fs,
frame_length=0.020, frame_stride=0.020,
num_filters=num_coefficient, fft_length=1024, low_frequency=0,
high_frequency=None)
So the frame_length==frame_stride
for non-overlapping scenario.
About the second part, you are absolutely correct. That was just an example. You should modify it as the convenience.
@nooshin85 I just changed the input_feature.py function for generating the exact shape for the cube.
thanks a lot for the acute response.
@nooshin85 My pleasure. Please close this issue once you realized that the problem has been resolved. Thanks
As you mentioned before "use of non-overlapping hamming windows for generating speech features" , I'm not sure how to do it here. Could you describe in detail the procedures here? Thank a lot!