DinoMan / speech-driven-animation

949 stars 289 forks source link

Image Alignment: Demo and Inference #44

Closed avisekiit closed 2 years ago

avisekiit commented 4 years ago

Dear Authors, Thanks for open-sourcing the code. I had a couple of questions regarding face alignment during running the demo and calculating SSIM.

First, in your README you mention that

va = sda.VideoAnimator(gpu=0)# Instantiate the animator vid, aud = va("example/image.bmp", "example/audio.wav")

a) My question: Can we pass any frame with a frontal face of any size, or we have to take care to systematically use your alignment framework to crop and align and then send to the network?

b) My 2nd question: If I want to calculate the SSIM, MSE of your paper I need to obviously get the original frames for a given sentence. These frames are not aligned and so, we need to align them first and resize to [96, 128] ? We can directly use the output of your network as the predicted set of frames and then we can compare the pair of <original, predicted> ?

Here is a short snippet I had written to align the frames (extracted with ffmpeg) of original sequence. Does it look Ok ?

`for k,image in enumerate(images): img = cv2.imread(os.path.join(base_dir, image)) print("{} | {} | Aligning: ".format(subject, video_name), os.path.join(base_dir, image)) src = fa.get_landmarks(img)[0][stablePntsIDs, :] dst = mean_face[stablePntsIDs, :] tform = tf.estimate_transform('similarity', src, dst) # find the transformation matrix warped = tf.warp(img, inverse_map=tform.inverse, output_shape=img_size) # wrap the frame warped = warped * 255 # note output from wrap is double image (value range [0,1]) warped = warped.astype('uint8') cv2.imwrite(store_dir + "/" + "{}_src.png".format(k).zfill(16), warped)

print("Done..Frame Alignment") `

DinoMan commented 2 years ago

Sorry for the super late reply :( I will answer now even though it's probably too late. The model expects aligned images (which is common for most models). The alignment process is done by the library if aligned is set to false so you do not need to do it yourself. Yes SSIM and MSE assumes you are comparing images (or videos) of the same dimensions so your videos need to be aligned and cropped in the same way