For video generation, it takes roughly 17s to generate each 30-second video at the current resolution 128x128. However, it will take much less than 17s*N for N videos.
In detail, our generation process contains mainly three parts: 1) load the pre-trained model on GPU 2) align the landmark of the input image with a given template image 3) generate the emotional talking faces. In my experiment, the three parts take 5, 9, 3 seconds, respectively. If we generate longer videos, the time cost in the first two parts will not change so that it will still be quick (e.g., 20s for generating a 1-min video).
For video generation, it takes roughly 17s to generate each 30-second video at the current resolution 128x128. However, it will take much less than 17s*N for N videos. In detail, our generation process contains mainly three parts: 1) load the pre-trained model on GPU 2) align the landmark of the input image with a given template image 3) generate the emotional talking faces. In my experiment, the three parts take 5, 9, 3 seconds, respectively. If we generate longer videos, the time cost in the first two parts will not change so that it will still be quick (e.g., 20s for generating a 1-min video).