I have a few questions about the sequential global-flow local-attention model. In Section IV.B, the model is utilized to generate video clips in a recurrent manner. The input contains a source image, pose sequence of a driving video, then the output should be a video with the appearance of the source image but the pose of the driving video. My questions are:
1) According to Fig. 5 (in the TIP version of the paper), in each time step, x_s will not be changed (as the source appearance) and the appearance feature will be injected into each time step generation? Moreover, the pose estimation result of the source image, i.e., p_s is actually not be modified by the motion extraction network (Section IV.A), is that correct? Since the input of the F_s and F_p are different, are they share weights or independent network structures?
2) In the beginning, there is no 'previous' generated image, right? so, I wonder, what is the \hat{x}_t^{k-1} in the initial time step? Is it the source image? I am confused about the process of sequential GFLA at the start point. Could you please provide more details on the training and testing process of the sequential GFLA model?
3) I wonder if a source image and poses of two consecutive frames can be utilized to synthesis the current frame, and the current generated frame can be used to produce the next frame (in a recurrent manner), why should the input source image and source pose be always involved? Is it used for appearance feature enhancement/preserving or is there other reasons for that?
x_s will not be changed. Ps is not modified by the motion extraction network. F_s and F_p do not share weights.
in the beginning, the previously generated image is the source image.
the source image is always used at each generation step since it can provide the most accurate information we have. At the start point of the training, we cannot generate meaningful results. Therefore, using the source images can stable the training.
Hi, thank you for releasing the code!
I have a few questions about the sequential global-flow local-attention model. In Section IV.B, the model is utilized to generate video clips in a recurrent manner. The input contains a source image, pose sequence of a driving video, then the output should be a video with the appearance of the source image but the pose of the driving video. My questions are:
1) According to Fig. 5 (in the TIP version of the paper), in each time step, x_s will not be changed (as the source appearance) and the appearance feature will be injected into each time step generation? Moreover, the pose estimation result of the source image, i.e., p_s is actually not be modified by the motion extraction network (Section IV.A), is that correct? Since the input of the F_s and F_p are different, are they share weights or independent network structures?
2) In the beginning, there is no 'previous' generated image, right? so, I wonder, what is the \hat{x}_t^{k-1} in the initial time step? Is it the source image? I am confused about the process of sequential GFLA at the start point. Could you please provide more details on the training and testing process of the sequential GFLA model?
3) I wonder if a source image and poses of two consecutive frames can be utilized to synthesis the current frame, and the current generated frame can be used to produce the next frame (in a recurrent manner), why should the input source image and source pose be always involved? Is it used for appearance feature enhancement/preserving or is there other reasons for that?
Thank you!