dr-pato / audio_visual_speech_enhancement

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments
https://dr-pato.github.io/audio_visual_speech_enhancement/
Apache License 2.0
106 stars 25 forks source link

Question about training vl2m with fixed TFRecord type #18

Closed hmy410 closed 4 years ago

hmy410 commented 4 years ago

Hello, thanks for your work. I'm training the vl2m model using grid dataset. l set TFRecord type='fixed', num_audio_samples=48000, batch_size=10. But there is an error when I start training:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Name: , Key: base_audio_wav, Index: 0. Number of float values != expected. values size: 19200 but output shape: [48000]

l tried to change num_audio_samples but it didn't work. Does 19200 mean the length of the 0th wav while 48000 means the number of audio wavs? Hope for your reply.

dr-pato commented 4 years ago

Hi, num_audio_samples is the length of audio array. If type is 'fixed' your wavs have to be of same length. In your case, it seems you have at least one wav with 19200 samples - instead of 48000. Maybe you have made mistakes during the audio preprocessing stage.. Are you able to produce one or more tfrecords?

hmy410 commented 4 years ago

Thanks for your advice. I changed type into 'var' because audio array have different length (I use grid dataset). I also reset the num_audio_samples. But I have another question now. Since AV concat-ref is retrained while freezing the parameters of the VL2M component, does it mean that I should output the TBM of VL2M and replace the origin TBM computed by LTASS?

dr-pato commented 4 years ago

Yes, you are right. You must save the TBMs of VL2M and generate the TFRecords replacing the original TBM with the estimated one with VL2M. Then you can train the AV concat-ref model.