Image Alignment: Demo and Inference

Dear Authors, Thanks for open-sourcing the code. I had a couple of questions regarding face alignment during running the demo and calculating SSIM.

First, in your README you mention that

va = sda.VideoAnimator(gpu=0)# Instantiate the animator vid, aud = va("example/image.bmp", "example/audio.wav")

a) My question: Can we pass any frame with a frontal face of any size, or we have to take care to systematically use your alignment framework to crop and align and then send to the network?

b) My 2nd question: If I want to calculate the SSIM, MSE of your paper I need to obviously get the original frames for a given sentence. These frames are not aligned and so, we need to align them first and resize to [96, 128] ? We can directly use the output of your network as the predicted set of frames and then we can compare the pair of <original, predicted> ?

Here is a short snippet I had written to align the frames (extracted with ffmpeg) of original sequence. Does it look Ok ?

`for k,image in enumerate(images): img = cv2.imread(os.path.join(base_dir, image)) print("{} | {} | Aligning: ".format(subject, video_name), os.path.join(base_dir, image)) src = fa.get_landmarks(img)[0][stablePntsIDs, :] dst = mean_face[stablePntsIDs, :] tform = tf.estimate_transform('similarity', src, dst) # find the transformation matrix warped = tf.warp(img, inverse_map=tform.inverse, output_shape=img_size) # wrap the frame warped = warped * 255 # note output from wrap is double image (value range [0,1]) warped = warped.astype('uint8') cv2.imwrite(store_dir + "/" + "{}_src.png".format(k).zfill(16), warped)

print("Done..Frame Alignment") `

DinoMan / speech-driven-animation

Image Alignment: Demo and Inference #44