dr-pato / audio_visual_speech_enhancement

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments
https://dr-pato.github.io/audio_visual_speech_enhancement/
Apache License 2.0
106 stars 25 forks source link

Training: Values size X by output shape Y #11

Closed nanometer34688 closed 4 years ago

nanometer34688 commented 4 years ago

Running the training command:

python3 av_speech_enhancement.py training --data_dir data/tf_records/ --train_set TRAINING_SET --val_set VALIDATION_SET --exp 1 --mode fixed -ns 48000 --model vl2m --opt adam -lr 0.005 -nl 1 -nh 1 -d 1

Gives me the error below

 Traceback (most recent call last):
  File "av_speech_enhancement.py", line 227, in <module>
    main()
  File "av_speech_enhancement.py", line 217, in main
    train(args.model, args.data_dir, args.train_set, args.val_set, config, args.exp, args.mode)
  File "/home/git/audio_visual_speech_enhancement/training.py", line 130, in train
    val_mixed_audio, val_base_paths, val_other_paths, val_mixed_paths = sess.run(next_val_batch)
  File "/home/git/audio_visual_speech_enhancement/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/git/audio_visual_speech_enhancement/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/git/audio_visual_speech_enhancement/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/git/audio_visual_speech_enhancement/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Name: , Key: base_audio_wav, Index: 0.  Number of float values != expected.  values size: 29440 but output shape: [48000]
         [[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
         [[validation_batch/IteratorGetNext]]

Am I doing something wrong?

I have already created the TF records ready for training.

Am i right in assuming that the num_audio_samples is the length of the audio of the output when training?

nanometer34688 commented 4 years ago

Hi @dr-pato,

I'm just wondering if you are able to help me understand this issue?

Thanks

dr-pato commented 4 years ago

Hi @nanometer34688, you got that error because there is mismatch between the length you declared (48000) and the real length of the wav in your tfrecord (29440). You can check if the lengths (in bytes) of your tfrecords are equal. If not, I guess something went wrong during generation of tfrecords.

nanometer34688 commented 4 years ago

I am slightly confused as all WAV files i have are a length of 22050. I looked at the length of my mixed audio files and they are also the same length. I am very confused as I have no idea where the length of 29440 comes from.

Running the command again throws a different number as seen below:

tensorflow.python.framework.errors_impl.InvalidArgumentError:  Name: , Key: base_audio_wav, Index: 0.  Number of float values != expected.  values size: 30560 but output shape: [48000]
         [[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
         [[validation_batch/IteratorGetNext]]

I have also had numbers such as 37120, 31040 and 34560.

When creating the TF records, it saves normalised npy files. Is there supposed to be only one mean/std npy file of both audio and video for each speaker? Or should there be one for each video of the speaker?

EG. i have: s2_audio_mean.py s2_audio_std.py s2_video_mean.py s2_video_std.py

Should it be like that? Or should it be more like: s2_l_bbam2p_audio_mean.py s2_l_bbam2p_audio_std.py s2_l_bbam2p_video_mean.py s2_l_bbam2p_video_std.py

And have that for each video and audio file?

dr-pato commented 4 years ago

I` am slightly confused as all WAV files i have are a length of 22050. I looked at the length of my mixed audio files and they are also the same length. I am very confused as I have no idea where the length of 29440 comes from.

It is a bit strange, check the sample rate of your original wavfiles. The sample rate of wavfile I used is 16 kHz. The wav lengths have to be all 48000 otherwise you will continue to get the error. So check your input data.

When creating the TF records, it saves normalised npy files. Is there supposed to be only one mean/std npy file of both audio and video for each speaker? Or should there be one for each video of the speaker?

Yes, in the paper we say that speaker-wise normalization is applied. So mean and standard deviation is computed using all audios/videos of the same speaker.

malineha commented 4 years ago

Hello nanometer, did u solve ur issue?

nanometer34688 commented 4 years ago

Unfortunately not!

Using the code below, I have found that the original WAV files do not have a length of 48000 to begin with:

import soundfile as sf f = sf.SoundFile('s2_l_bbim3a.wav') print('samples = {}'.format(len(f)))

They come back in a range between 39360 and 49000. These are the original WAV files downloaded from the dataset that was suggested. Is that right? Does the original data vary in length?

nanometer34688 commented 4 years ago

Yes, in the paper we say that speaker-wise normalization is applied. So mean and standard deviation is computed using all audios/videos of the same speaker

So am i correct in my understanding that this is used as a mask to filter over mixed speech to output the target speaker?

nanometer34688 commented 4 years ago

Hi @dr-pato

I have managed to fix my issue. Both my single audio and mixed audio files were not constantly at 48000 samples.

I rewrote them all files to 48000 samples and it now seems to be working.

Can i just confirm this:

Yes, in the paper we say that speaker-wise normalization is applied. So mean and standard deviation is computed using all audios/videos of the same speaker

So am i correct in my understanding that this is used as a mask to filter over mixed speech to output the target speaker?

dr-pato commented 4 years ago

What do you mean? All the models outputs a time-frequency mask that is multiplied by mixed-speech spectrogram to obtain spectrogram of the target speaker.