Closed nanometer34688 closed 4 years ago
Hi @dr-pato,
I'm just wondering if you are able to help me understand this issue?
Thanks
Hi @nanometer34688, you got that error because there is mismatch between the length you declared (48000) and the real length of the wav in your tfrecord (29440). You can check if the lengths (in bytes) of your tfrecords are equal. If not, I guess something went wrong during generation of tfrecords.
I am slightly confused as all WAV files i have are a length of 22050. I looked at the length of my mixed audio files and they are also the same length. I am very confused as I have no idea where the length of 29440 comes from.
Running the command again throws a different number as seen below:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Name: , Key: base_audio_wav, Index: 0. Number of float values != expected. values size: 30560 but output shape: [48000]
[[{{node ParseSingleSequenceExample/ParseSingleSequenceExample}}]]
[[validation_batch/IteratorGetNext]]
I have also had numbers such as 37120, 31040 and 34560.
When creating the TF records, it saves normalised npy files. Is there supposed to be only one mean/std npy file of both audio and video for each speaker? Or should there be one for each video of the speaker?
EG. i have: s2_audio_mean.py s2_audio_std.py s2_video_mean.py s2_video_std.py
Should it be like that? Or should it be more like: s2_l_bbam2p_audio_mean.py s2_l_bbam2p_audio_std.py s2_l_bbam2p_video_mean.py s2_l_bbam2p_video_std.py
And have that for each video and audio file?
I` am slightly confused as all WAV files i have are a length of 22050. I looked at the length of my mixed audio files and they are also the same length. I am very confused as I have no idea where the length of 29440 comes from.
It is a bit strange, check the sample rate of your original wavfiles. The sample rate of wavfile I used is 16 kHz. The wav lengths have to be all 48000 otherwise you will continue to get the error. So check your input data.
When creating the TF records, it saves normalised npy files. Is there supposed to be only one mean/std npy file of both audio and video for each speaker? Or should there be one for each video of the speaker?
Yes, in the paper we say that speaker-wise normalization is applied. So mean and standard deviation is computed using all audios/videos of the same speaker.
Hello nanometer, did u solve ur issue?
Unfortunately not!
Using the code below, I have found that the original WAV files do not have a length of 48000 to begin with:
import soundfile as sf f = sf.SoundFile('s2_l_bbim3a.wav') print('samples = {}'.format(len(f)))
They come back in a range between 39360 and 49000. These are the original WAV files downloaded from the dataset that was suggested. Is that right? Does the original data vary in length?
Yes, in the paper we say that speaker-wise normalization is applied. So mean and standard deviation is computed using all audios/videos of the same speaker
So am i correct in my understanding that this is used as a mask to filter over mixed speech to output the target speaker?
Hi @dr-pato
I have managed to fix my issue. Both my single audio and mixed audio files were not constantly at 48000 samples.
I rewrote them all files to 48000 samples and it now seems to be working.
Can i just confirm this:
Yes, in the paper we say that speaker-wise normalization is applied. So mean and standard deviation is computed using all audios/videos of the same speaker
So am i correct in my understanding that this is used as a mask to filter over mixed speech to output the target speaker?
What do you mean? All the models outputs a time-frequency mask that is multiplied by mixed-speech spectrogram to obtain spectrogram of the target speaker.
Running the training command:
python3 av_speech_enhancement.py training --data_dir data/tf_records/ --train_set TRAINING_SET --val_set VALIDATION_SET --exp 1 --mode fixed -ns 48000 --model vl2m --opt adam -lr 0.005 -nl 1 -nh 1 -d 1
Gives me the error below
Am I doing something wrong?
I have already created the TF records ready for training.
Am i right in assuming that the num_audio_samples is the length of the audio of the output when training?