Closed baiyuting closed 1 year ago
We managed to generate videos up to 9s long.
The problem is caused by a lack of driving video (or any other source of movement) and an iterative generation process where errors are propagated from previously synthesized frames. For a model trained on sequences, the problem is less likely to appear but usually it requires much bigger computing power to handle additional time axis. Also, achieving temporal consistency is then more challenging.
the paper say the model could not generate long video, so, how long is the video generated by current model? Is it a common problem for all talking head generation methods given only an image and an audio?