The lips regressor predicts unexpected result

HarryXD2018 commented 7 months ago

Hi, what nice work with such a wonderful result, and thanks for being open-sourced.

However, I ran into this problem while trying to read and learn the code: In the lips regressor module, an Encoder-Decoder structure empowered by the pre-trained wav2vec2 is designed. There are some things that confuse me:

Usually, the contextual feature extracted by w2v2 is then fed into the Decoder (e.g., TCN for SHOW and Transformer decoder for FaceFormer); why an Encoder-Decoder structure is necessary?
No attention mask by default. As provided code below, the causal is set to False as default. Since proposed in FaceFormer, a causal attention mask has been adopted in many following works to add inductive bias. So it kind of confused me why you chose not to, even though the parameter causal is programmed. https://github.com/facebookresearch/audio2photoreal/blob/3a94699243ff66255398532f1705b0b31e0e1ae7/model/diffusion.py#L274

As a result, the visualizations of the 338 vertices sequence are not looking good. Here are some examples (30fps) I saved when running python -m demo.demo, where the save-to-numpy command is inserted after https://github.com/facebookresearch/audio2photoreal/blob/3a94699243ff66255398532f1705b0b31e0e1ae7/model/diffusion.py#L309

https://github.com/facebookresearch/audio2photoreal/assets/42205546/7eb5b001-a448-4bcd-86a9-10fa3c577c33

https://github.com/facebookresearch/audio2photoreal/assets/42205546/002be962-ae05-430d-971f-326c564cb721

I also tried to set causal = True, and the result is shown below.

https://github.com/facebookresearch/audio2photoreal/assets/42205546/00652634-42c3-4725-8537-40935882d35e

I also checked the input audios recorded by my microphone (about 5-7s), and all of the inputs are spoken in English.

Please help me out if you have any idea, thanks in advance.

HarryXD2018 commented 7 months ago

If someone is trying to reproduce the video above, here is my render code.

from matplotlib import pyplot as plt
import cv2
import numpy as np
from tqdm import tqdm
import os

def plot_3d(scatters, frame_id):
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    # set the ax border
    ax.set_xlim(border[0], border[1])
    ax.set_ylim(border[2], border[3])
    ax.set_zlim(border[4], border[5])
    ax.set_box_aspect([1, 1, 1])

    ax.scatter(scatters[0], scatters[1], scatters[2])
    # save the plot as image
    plt.savefig('./rendered/lips_{:06d}.png'.format(frame_id))
    plt.close()

if __name__ == '__main__':
    npy_file_name = 'lips_vertice_causal.npy'
    save_name = 'lips_causal.avi'

    if os.path.exists('./rendered'):
        os.system('rm -r ./rendered')
    os.makedirs('./rendered')

    data = np.load(npy_file_name).reshape(-1, 338, 3)
    # print(data.shape)
    x, y, z = data[..., 0], data[..., 1], data[..., 2]
    # print(x.shape, y.shape, z.shape)
    border = [x.min(), x.max(), y.min(), y.max(), z.min(), z.max()]
    # print(border)
    for idx in tqdm(range(x.shape[0]), desc='plotting'):
        plot_3d([x[idx, :], y[idx, :], z[idx, :]], idx)

    # save as video
    img = cv2.imread('./rendered/lips_000000.png')
    h, w, _ = img.shape
    fourcc = cv2.VideoWriter_fourcc(*'XVID')
    video = cv2.VideoWriter(save_name, fourcc, 30, (w, h))
    for i in tqdm(range(x.shape[0]), desc='writing video'):
        video.write(cv2.imread('./rendered/lips_{:06d}.png'.format(i)))
    video.release()
    cv2.destroyAllWindows()

alexanderrichard commented 7 months ago

Hi and thanks for your interest in our work! To your questions:

Yeah, it's not really an encoder-decoder structure, it's really just a single straight-through network. As you can see, the transformer decoder does not receive informative input, only zeros: https://github.com/facebookresearch/audio2photoreal/blob/3a94699243ff66255398532f1705b0b31e0e1ae7/model/diffusion.py#L74 So, in other words, the audio-to-lip module is just a regressor that goes from wav2vec to vertex space with a few transformer-style operations. The architecture looks a bit confusing, which is an artifact of other experiments to make the module not a regressor but an actual diffusion model. You can just ignore this :)
No attention mask. Correct. Our whole framework is an acausal model, so there is no need to induce causality in the audio encodings or in the lip regressor.
Lip vertex visualization. The model doesn't predict the vertices in its original vertex space, but in a z-normalized space (so each vertex has zero mean and unit variance). If you'd want to see the actual lip vertices, you'd have to revert that transformation. Is this something you need? In that case I could see to dig it up for you.

HarryXD2018 commented 7 months ago

Thank you so much for replying. I guess transforming to z-normed space will benefit training, is that correct?

If convenient, I would appreciate it if you could show me how to revert to the original vertex space.

Many thanks :)

alexanderrichard commented 6 months ago

Hey! The version of the lip regressor used in here actually uses a more complex decoding from the lip vertex space that you plotted which we unfortunately can't provide publicly since you would be able to render lip-information of participants that are not approved for public release. Sorry :(

HarryXD2018 commented 6 months ago

Thanks for replying! Closing the issue as the lips regressor is making a reasonable prediction.

facebookresearch / audio2photoreal

The lips regressor predicts unexpected result #50