georgesterpu / avsr-tf1

Audio-Visual Speech Recognition using Sequence to Sequence Models
GNU General Public License v3.0
81 stars 28 forks source link

[feature] minimum data length for stack log mel feature #14

Closed LeeYongHyeok closed 5 years ago

LeeYongHyeok commented 5 years ago

Hi @georgesterpu

I do curriculum learning for LRS3 dataset.

So i cut the pretrain wav files to 1~3 words, (about 500ms), and make the feature TFRecord.

I make stacked log mel feature, stacked_w8s3 However, an error occurred.

I did not leave an error log, but I think the length of the wav file is short.

Is there a minimum length of data needed to create a stacked log mel feature?

georgesterpu commented 5 years ago

We only tried sentences as short as the shortest sentence in LRS2, sorry for not having an exact value right now.

There are multiple error sources that I can think of (stft code in tensorflow, the feature stacking code in numpy etc.), and it would be a lot easier if you provided an error log and/or a call stack.

However, I'd expect the code to run well on 500 ms audio segments.

Are you also using the latest code on the master branch ? @saamc found a bug a couple of weeks ago, where the length of the noise segment could be one sample shorter than the length of the input signal if you are unlucky enough with the random number generator.

LeeYongHyeok commented 5 years ago

@georgesterpu, Thanks for your reply.

I will try newest code.