Motion is longer than audio on gesture viewer.

GestureGeneration / Speech_driven_gesture_generation_with_autoencoder

This is the official implementation for IVA '19 paper "Analyzing Input and Output Representations for Speech-Driven Gesture Generation".

https://svito-zar.github.io/audio2gestures/

Apache License 2.0

106 stars 27 forks source link

Motion is longer than audio on gesture viewer. #4

Closed wubowen416 closed 4 years ago

wubowen416 commented 4 years ago

Thank you for your patience on previous question, hope you could help me again!

After predicting from encoded file DATA_DIR/325/Y_test_encoded.npy, and after remove velocity, I implemented the corresponding audio and motion file on gesture viewer. It works however the motion appears to be longer than audio. Do you know why this happens? Or should I check if I did something wrong on file name.

Also I noticed that during decode using decode.py, there are

print(encoding.shape)

# Decode it
decoding = tr.decode(nn, encoding)

print(decoding.shape)

Since the shape is printed, I think the first dim of shape of encoding and decoding should be same since they represents time steps, but they were different. Is this the reason why the audio and motion are not as same length?

wubowen416 commented 4 years ago

Even the original one doesn't match..

I used convert_original.py to get the original motion files by changing these:

parser.add_argument('--data', '-d', default='/home/wu/projects/motiong/Speech_driven_gesture_generation_with_autoencoder/data/test/labels',
                        help='Path to the original test motion data directory')

parser.add_argument('--out', '-o', default='/home/wu/projects/motiong/Speech_driven_gesture_generation_with_autoencoder/evaluation/data/original',
                        help='Directory to store the resultant position files')

which is dir in my computer.

Then I apply audio and original for example, audio1094.wav and gesture1094.txt, however they are not with same length.

Svito-zar commented 4 years ago

For decode.py - I think the different in length comes because of the big batch size during decoding. Did you use -batch_size=8 when calling decode.py? I suggest comparing the difference in length with the batch size.

As for the convert_original.py - I need more information to help you with it. Can you, please, provide the length of both sequences?

wubowen416 commented 4 years ago

Thank you for your reply.

I checked my command and the batch size for decode.py was 8. But I didn’t check what will change if the batch changes, I will check it.

For another, do you mean the sequences of converted motion file and .wav file?

Svito-zar commented 4 years ago

Yes. Could you, please, provide the length of motions and audios you are obtaining? In both cases : with the model and with the ground truth conversion.

wubowen416 commented 4 years ago

With this code

import numpy as np

gen = '../data/generated_gesture/no_vel/X_test_audio1094_fps20.txt'
gro = "../evaluation/data/original/gesture1094.txt"
audio = '../data/test_inputs/X_test_audio1094.npy'

gen = np.loadtxt(gen)
gro = np.loadtxt(gro)
audio = np.load(audio)

print('shape of generated:(without velocity)', gen.shape)
print('shape of groundtruth:(without velocity)', gro.shape)
print('shape of audio:', audio.shape)

the output is this:

shape of generated:(without velocity) (240, 192)
shape of groundtruth:(without velocity) (245, 192)
shape of audio: (243, 61, 26)

Svito-zar commented 4 years ago

I think the difference in a few frames is negligible and we simply ignored it in our experiments. So in your case we would crop all the sequences to 240 frames. The different length does not mean misalignment, but it is rather just a processing artifact.

For the generated motion the sequence length would be always a multiple of the batch size. So in this case 240 is a multiple of 8. If you want to match the length exactly - you need to choose the batch size 1 (that would make testing slower, of course).

As for the ground truth processing - I am not sure what exactly is happening. I think it has smth to do with the processing as well. Maybe @aoikaneko can shade some light on that.

aoikaneko commented 4 years ago

Hi wubowen416, thank you for your interest in our work.

shape of groundtruth:(without velocity) (245, 192) shape of audio: (243, 61, 26)

I think that the 2-frame difference comes from the sliding window in MFCC computation, as in create_vector.py. When we convert the original to .txt, we do not use the length of the input audio feature (we just look at a BVH file). So there is a difference because we do not cut them into the same length in convert_original.py.

Svito-zar commented 4 years ago

@wubowen416 , is your question answered fully? If so - feel free to close this issue

wubowen416 commented 4 years ago

Yes your answer are amazing! Thank you guys! I see what is happening!