Closed HarryXD2018 closed 6 months ago
If someone is trying to reproduce the video above, here is my render code.
from matplotlib import pyplot as plt
import cv2
import numpy as np
from tqdm import tqdm
import os
def plot_3d(scatters, frame_id):
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# set the ax border
ax.set_xlim(border[0], border[1])
ax.set_ylim(border[2], border[3])
ax.set_zlim(border[4], border[5])
ax.set_box_aspect([1, 1, 1])
ax.scatter(scatters[0], scatters[1], scatters[2])
# save the plot as image
plt.savefig('./rendered/lips_{:06d}.png'.format(frame_id))
plt.close()
if __name__ == '__main__':
npy_file_name = 'lips_vertice_causal.npy'
save_name = 'lips_causal.avi'
if os.path.exists('./rendered'):
os.system('rm -r ./rendered')
os.makedirs('./rendered')
data = np.load(npy_file_name).reshape(-1, 338, 3)
# print(data.shape)
x, y, z = data[..., 0], data[..., 1], data[..., 2]
# print(x.shape, y.shape, z.shape)
border = [x.min(), x.max(), y.min(), y.max(), z.min(), z.max()]
# print(border)
for idx in tqdm(range(x.shape[0]), desc='plotting'):
plot_3d([x[idx, :], y[idx, :], z[idx, :]], idx)
# save as video
img = cv2.imread('./rendered/lips_000000.png')
h, w, _ = img.shape
fourcc = cv2.VideoWriter_fourcc(*'XVID')
video = cv2.VideoWriter(save_name, fourcc, 30, (w, h))
for i in tqdm(range(x.shape[0]), desc='writing video'):
video.write(cv2.imread('./rendered/lips_{:06d}.png'.format(i)))
video.release()
cv2.destroyAllWindows()
Hi and thanks for your interest in our work! To your questions:
Yeah, it's not really an encoder-decoder structure, it's really just a single straight-through network. As you can see, the transformer decoder does not receive informative input, only zeros: https://github.com/facebookresearch/audio2photoreal/blob/3a94699243ff66255398532f1705b0b31e0e1ae7/model/diffusion.py#L74 So, in other words, the audio-to-lip module is just a regressor that goes from wav2vec to vertex space with a few transformer-style operations. The architecture looks a bit confusing, which is an artifact of other experiments to make the module not a regressor but an actual diffusion model. You can just ignore this :)
No attention mask. Correct. Our whole framework is an acausal model, so there is no need to induce causality in the audio encodings or in the lip regressor.
Lip vertex visualization. The model doesn't predict the vertices in its original vertex space, but in a z-normalized space (so each vertex has zero mean and unit variance). If you'd want to see the actual lip vertices, you'd have to revert that transformation. Is this something you need? In that case I could see to dig it up for you.
Thank you so much for replying. I guess transforming to z-normed space will benefit training, is that correct?
If convenient, I would appreciate it if you could show me how to revert to the original vertex space.
Many thanks :)
Hey! The version of the lip regressor used in here actually uses a more complex decoding from the lip vertex space that you plotted which we unfortunately can't provide publicly since you would be able to render lip-information of participants that are not approved for public release. Sorry :(
Thanks for replying! Closing the issue as the lips regressor is making a reasonable prediction.
Hi, what nice work with such a wonderful result, and thanks for being open-sourced.
However, I ran into this problem while trying to read and learn the code: In the lips regressor module, an Encoder-Decoder structure empowered by the pre-trained wav2vec2 is designed. There are some things that confuse me:
causal
is set toFalse
as default. Since proposed in FaceFormer, a causal attention mask has been adopted in many following works to add inductive bias. So it kind of confused me why you chose not to, even though the parametercausal
is programmed. https://github.com/facebookresearch/audio2photoreal/blob/3a94699243ff66255398532f1705b0b31e0e1ae7/model/diffusion.py#L274As a result, the visualizations of the 338 vertices sequence are not looking good. Here are some examples (30fps) I saved when running
python -m demo.demo
, where the save-to-numpy command is inserted after https://github.com/facebookresearch/audio2photoreal/blob/3a94699243ff66255398532f1705b0b31e0e1ae7/model/diffusion.py#L309https://github.com/facebookresearch/audio2photoreal/assets/42205546/7eb5b001-a448-4bcd-86a9-10fa3c577c33
https://github.com/facebookresearch/audio2photoreal/assets/42205546/002be962-ae05-430d-971f-326c564cb721
I also tried to set
causal = True
, and the result is shown below.https://github.com/facebookresearch/audio2photoreal/assets/42205546/00652634-42c3-4725-8537-40935882d35e
I also checked the input audios recorded by my microphone (about 5-7s), and all of the inputs are spoken in English.
Please help me out if you have any idea, thanks in advance.