Closed zaverichintan closed 3 years ago
Thanks for your interest!
I suggest you to try different vid
(i.e., speaker id) values. Current synthesize.py
script uses a random vid
, so you might run the script multiple times and compare the results to find out which vid works better.
The generation result is also dependent to speech audio. You can use actual human speech instead of google TTS (or choose different voices in google TTS). In my experiences, high-pitch voices gave more energetic gestures.
Thanks for your prompt response. I will try out changing vid and test with human audio. Is there any function to traverse in the style embedding space? As mentioned in the paper, where the motion can be controlled?
No, there isn't an interactive tool. In the paper, I tried every vid for all the test samples and calculated some statistics like mean variance and handedness.
Can you point me to the code for handedness? I could see accel in the evaluation code, are they same?
That part is not in the repository.
The following is the code snippet that I used to calculate variance
and handedness
for the figure in the paper. It's not the whole code but I hope you would get how it works.
gesture_variance = []
gesture_handedness = []
for vid in all_vid:
outputs = []
for iter_idx, data in enumerate(val_loader, 0):
in_text, text_lengths, in_text_padded, target_pose, target_vec, in_audio, in_spec, aux_info = data
batch_size = target_pose.size(0)
# to gpu
in_text = in_text.to(device)
in_text_padded = in_text_padded.to(device)
in_spec = in_spec.to(device)
in_audio = in_audio.to(device)
target_pose = target_pose.to(device)
target_vec = target_vec.to(device)
target = target_vec
# speaker input
vid_indices = np.repeat(vid, batch_size)
vid_indices = torch.LongTensor(vid_indices).to(device)
# inference
with torch.no_grad():
pre_seq = target.new_zeros((target.shape[0], target.shape[1], target.shape[2] + 1))
pre_seq[:, 0:args.n_pre_poses, :-1] = target[:, 0:args.n_pre_poses]
pre_seq[:, 0:args.n_pre_poses, -1] = 1 # indicating bit for constraints
out_vec, *_ = generator(pre_seq, in_text_padded, in_audio, vid_indices)
out_vec = out_vec.cpu().numpy()
outputs.append(out_vec)
outputs = np.vstack(outputs)
gesture_variance.append(np.mean(np.var(outputs, axis=1)))
right_var = np.var(outputs[:, :, 12:18], axis=1)
right_var = np.mean(right_var)
left_var = np.var(outputs[:, :, 21:27], axis=1)
left_var = np.mean(left_var)
gesture_handedness.append(right_var - left_var)
Thanks a lot for the help
Hello, Thank you for the amazing paper and code. I am curious if I can change the f_style as mentioned in the paper to EB or EL, how to go about making this change? The gestures do not work well for custom text, so you have any suggestions for that?