Hi, I see that there is such a setting in the paper: Given a scretch or image, and then adding text to generate a video, have you tested this setting with some evaluation metrics, I don’t see it in the paper. In addition, for table 2 in the paper, I see that the frame consistency of the video generated by the given depth sequence (motion information) and text is not very high, is it because of jitter? If possible, can you share with me the IDs of the 1000 Webvid videos used for testing in table 2, and we may follow your work to compare our methods.
Hi, I see that there is such a setting in the paper: Given a scretch or image, and then adding text to generate a video, have you tested this setting with some evaluation metrics, I don’t see it in the paper. In addition, for table 2 in the paper, I see that the frame consistency of the video generated by the given depth sequence (motion information) and text is not very high, is it because of jitter? If possible, can you share with me the IDs of the 1000 Webvid videos used for testing in table 2, and we may follow your work to compare our methods.