Closed wubowen416 closed 4 years ago
Even the original one doesn't match..
I used convert_original.py to get the original motion files by changing these:
parser.add_argument('--data', '-d', default='/home/wu/projects/motiong/Speech_driven_gesture_generation_with_autoencoder/data/test/labels',
help='Path to the original test motion data directory')
parser.add_argument('--out', '-o', default='/home/wu/projects/motiong/Speech_driven_gesture_generation_with_autoencoder/evaluation/data/original',
help='Directory to store the resultant position files')
which is dir in my computer.
Then I apply audio and original for example, audio1094.wav and gesture1094.txt, however they are not with same length.
For decode.py - I think the different in length comes because of the big batch size during decoding.
Did you use -batch_size=8
when calling decode.py?
I suggest comparing the difference in length with the batch size.
As for the convert_original.py - I need more information to help you with it. Can you, please, provide the length of both sequences?
Thank you for your reply.
I checked my command and the batch size for decode.py was 8. But I didn’t check what will change if the batch changes, I will check it.
For another, do you mean the sequences of converted motion file and .wav file?
Yes. Could you, please, provide the length of motions and audios you are obtaining? In both cases : with the model and with the ground truth conversion.
With this code
import numpy as np
gen = '../data/generated_gesture/no_vel/X_test_audio1094_fps20.txt'
gro = "../evaluation/data/original/gesture1094.txt"
audio = '../data/test_inputs/X_test_audio1094.npy'
gen = np.loadtxt(gen)
gro = np.loadtxt(gro)
audio = np.load(audio)
print('shape of generated:(without velocity)', gen.shape)
print('shape of groundtruth:(without velocity)', gro.shape)
print('shape of audio:', audio.shape)
the output is this:
shape of generated:(without velocity) (240, 192)
shape of groundtruth:(without velocity) (245, 192)
shape of audio: (243, 61, 26)
I think the difference in a few frames is negligible and we simply ignored it in our experiments. So in your case we would crop all the sequences to 240 frames. The different length does not mean misalignment, but it is rather just a processing artifact.
For the generated motion the sequence length would be always a multiple of the batch size. So in this case 240 is a multiple of 8. If you want to match the length exactly - you need to choose the batch size 1 (that would make testing slower, of course).
As for the ground truth processing - I am not sure what exactly is happening. I think it has smth to do with the processing as well. Maybe @aoikaneko can shade some light on that.
Hi wubowen416, thank you for your interest in our work.
shape of groundtruth:(without velocity) (245, 192) shape of audio: (243, 61, 26)
I think that the 2-frame difference comes from the sliding window in MFCC computation, as in create_vector.py. When we convert the original to .txt, we do not use the length of the input audio feature (we just look at a BVH file). So there is a difference because we do not cut them into the same length in convert_original.py.
@wubowen416 , is your question answered fully? If so - feel free to close this issue
Yes your answer are amazing! Thank you guys! I see what is happening!
Thank you for your patience on previous question, hope you could help me again!
After predicting from encoded file DATA_DIR/325/Y_test_encoded.npy, and after remove velocity, I implemented the corresponding audio and motion file on gesture viewer. It works however the motion appears to be longer than audio. Do you know why this happens? Or should I check if I did something wrong on file name.
Also I noticed that during decode using decode.py, there are
Since the shape is printed, I think the first dim of shape of encoding and decoding should be same since they represents time steps, but they were different. Is this the reason why the audio and motion are not as same length?