ai4r / Gesture-Generation-from-Trimodal-Context

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity (SIGGRAPH Asia 2020)
Other
243 stars 35 forks source link

Changing style on inference #23

Closed zaverichintan closed 3 years ago

zaverichintan commented 3 years ago

Hello, Thank you for the amazing paper and code. I am curious if I can change the f_style as mentioned in the paper to EB or EL, how to go about making this change? The gestures do not work well for custom text, so you have any suggestions for that?

youngwoo-yoon commented 3 years ago

Thanks for your interest! I suggest you to try different vid (i.e., speaker id) values. Current synthesize.py script uses a random vid, so you might run the script multiple times and compare the results to find out which vid works better. The generation result is also dependent to speech audio. You can use actual human speech instead of google TTS (or choose different voices in google TTS). In my experiences, high-pitch voices gave more energetic gestures.

zaverichintan commented 3 years ago

Thanks for your prompt response. I will try out changing vid and test with human audio. Is there any function to traverse in the style embedding space? As mentioned in the paper, where the motion can be controlled?

youngwoo-yoon commented 3 years ago

No, there isn't an interactive tool. In the paper, I tried every vid for all the test samples and calculated some statistics like mean variance and handedness.

zaverichintan commented 3 years ago

Can you point me to the code for handedness? I could see accel in the evaluation code, are they same?

youngwoo-yoon commented 3 years ago

That part is not in the repository.

The following is the code snippet that I used to calculate variance and handedness for the figure in the paper. It's not the whole code but I hope you would get how it works.

gesture_variance = []
gesture_handedness = []
for vid in all_vid:
    outputs = []
    for iter_idx, data in enumerate(val_loader, 0):
        in_text, text_lengths, in_text_padded, target_pose, target_vec, in_audio, in_spec, aux_info = data
        batch_size = target_pose.size(0)

        # to gpu
        in_text = in_text.to(device)
        in_text_padded = in_text_padded.to(device)
        in_spec = in_spec.to(device)
        in_audio = in_audio.to(device)
        target_pose = target_pose.to(device)
        target_vec = target_vec.to(device)
        target = target_vec

        # speaker input
        vid_indices = np.repeat(vid, batch_size)
        vid_indices = torch.LongTensor(vid_indices).to(device)

        # inference
        with torch.no_grad():
            pre_seq = target.new_zeros((target.shape[0], target.shape[1], target.shape[2] + 1))
            pre_seq[:, 0:args.n_pre_poses, :-1] = target[:, 0:args.n_pre_poses]
            pre_seq[:, 0:args.n_pre_poses, -1] = 1  # indicating bit for constraints
            out_vec, *_ = generator(pre_seq, in_text_padded, in_audio, vid_indices)

        out_vec = out_vec.cpu().numpy()
        outputs.append(out_vec)

outputs = np.vstack(outputs)
gesture_variance.append(np.mean(np.var(outputs, axis=1)))

right_var = np.var(outputs[:, :, 12:18], axis=1)
right_var = np.mean(right_var)
left_var = np.var(outputs[:, :, 21:27], axis=1)
left_var = np.mean(left_var)
gesture_handedness.append(right_var - left_var)
zaverichintan commented 3 years ago

Thanks a lot for the help