Closed Benjamin-So closed 1 month ago
Thanks for your interest.
We used the original inference code of VQ-Diffusion. Regarding video generation in Figure 3, we used ground truth condition frames to predict the next frame, i.e., predicting s{t+1} based on s{t:t-k}. It does not use initial frames to predict all future images with an open loop.
I see. Thank you. Do you recall what 'k' is for the figure?
The value of 'k' here is 1, which is the same as the one used in RL. This means we use 2 historical frames to predict one future frame.
Thank you! I see.
I apologize if I misunderstand but after taking a look at the VQ-Diffusion library, it seems that predicted frames can be conditioned on historical frames as well as a text-input. Did you write your own code based on the original inference code of VQ-Diffusion or were you able to use the provided VQ-Diffusion code as it is written.
We did some modifications to the original codebase of VQ-Diffusion.
I remember that they do not provide the interface of using historical frames (frames -> VQ-code -> condition) as the condition. So we additionally write this part.
Meanwhile, we also incorporate Hydra to manage the configuration, which should be more convenient for users.
Thank you for that extra information. Do you have this code on hand to share?
We have already shared this code in this repo. You may refer to VQ-Diffusion part in this repo for more details.
Closing because of inactivity.
Hello,
I've been trying to apply your method to a different environment, but I'm having issues at the reinforcement learning step. I was wondering if perhaps the trained vq_diffusion model was struggling to generate quality samples of next frame predictions. I was wondering if you could share how you were able to roll out the 14 image predictions in figure 3 based on the initial input image. Did you use a piece of the codebase you shared? Thank you for your help!