Closed chenxwh closed 7 months ago
Thanks @chenxwh , will take a look into it
Max
@chenxwh Thank you for your hard work! I was wondering if we could add "ddim_init_latents_t_idx": 0 (default), "pnp_f_t": 1.0 (default), "pnp_spatial_attn_t": 1.0 (default), and "pnp_temp_attn_t": 1.0 (default) to the tweakable configs in the Replicate page.
@chenxwh Thank you for your hard work! I was wondering if we could add "ddim_init_latents_t_idx": 0 (default), "pnp_f_t": 1.0 (default), "pnp_spatial_attn_t": 1.0 (default), and "pnp_temp_attn_t": 1.0 (default) to the tweakable configs in the Replicate page.
Sure, happy to! Could you maybe provide some short description to those variables so I can add them to the demo too? I think it'll help people understand better how to set them. Thank you!
@chenxwh Thank you for your hard work! I was wondering if we could add "ddim_init_latents_t_idx": 0 (default), "pnp_f_t": 1.0 (default), "pnp_spatial_attn_t": 1.0 (default), and "pnp_temp_attn_t": 1.0 (default) to the tweakable configs in the Replicate page.
Sure, happy to! Could you maybe provide some short description to those variables so I can add them to the demo too? I think it'll help people understand better how to set them. Thank you!
ddim_init_latents_t_idx
: This parameter determines the time step index at which to begin sampling from the initial DDIM inversed latents, with a range of [0, num_sampling_steps-1] and a default value of 0. In the context of a DDIM sampling process where the sampling step is 50, the scheduler progresses through the time steps in the sequence [981, 961, 941, ..., 1]. Therefore, setting ddim_init_latents_t_idx to 0 initiates the sampling from t=981, whereas setting it to 1 starts the process at t=961. A higher index enhances motion consistency with the source video but may lead to flickering and cause the edited video to diverge from the edited first frame.pnp_f_inject_t
: Specifies the proportion of time steps in the DDIM sampling process where the convolutional injection is applied. The value ranges from [0.0, 1.0], with the default set to 1.0, indicating injection at every time step. pnp_spatial_attn_t
: Specifies the proportion of time steps in the DDIM sampling process where the spatial attention injection is applied. The value ranges from [0.0, 1.0], with the default set to 1.0, indicating injection at every time step.pnp_temp_attn_t
: Specifies the proportion of time steps in the DDIM sampling process where the temporal attention injection is applied. The value ranges from [0.0, 1.0], with the default set to 1.0, indicating injection at every time step.pnp_f_inject_t
, pnp_spatial_attn_t
, and pnp_temp_attn_t
, a higher value improves motion consistency with the source video. However, if the edited first frame differs too much from the original first frame, a higher value may cause flickering.thanks @vinesmsuic, I have added those to the demo now!
@chenxwh Thanks! Please checkout this updated config descriptions:
ddim_init_latents_t_idx
: This parameter determines the time step index at which to begin sampling from the initial DDIM inversed latents, with a range of [0, num_sampling_steps-1] and a default value of 0. In the context of a DDIM sampling process where the sampling step is 50, the scheduler progresses through the time steps in the sequence [981, 961, 941, ..., 1]. Therefore, setting ddim_init_latents_t_idx to 0 initiates the sampling from t=981, whereas setting it to 1 starts the process at t=961. A higher index enhances motion consistency with the source video but may lead to flickering and cause the edited video to diverge from the edited first frame.pnp_f_inject_t
: Specifies the proportion of time steps in the DDIM sampling process where the convolutional injection is applied. The value ranges from [0.0, 1.0], with the default set to 1.0, indicating injection at every time step. pnp_spatial_attn_t
: Specifies the proportion of time steps in the DDIM sampling process where the spatial attention injection is applied. The value ranges from [0.0, 1.0], with the default set to 1.0, indicating injection at every time step.pnp_temp_attn_t
: Specifies the proportion of time steps in the DDIM sampling process where the temporal attention injection is applied. The value ranges from [0.0, 1.0], with the default set to 1.0, indicating injection at every time step.pnp_f_inject_t
, pnp_spatial_attn_t
, and pnp_temp_attn_t
, a higher value improves motion consistency with the source video. However, if the edited first frame differs too much from the original first frame, a higher value may cause flickering.@chenxwh right, We found that 1.0 on the 3 pnp injection value works best for prompt-based editing on I2VGen-XL, so maybe its better to use 1.0 for the demo. Sorry for the confusion. Can you create another commit change?
Sure! The latest changes reflect the updated default value with detailed descriptions for those values. An updated example is also added to the demo.
Thanks for the merge! I have redirected the page to https://replicate.com/tiger-ai-lab/anyv2v and added you to the tiger-ai-lab
org (https://replicate.com/tiger-ai-lab) so you have the authority to make any changes to the page! And always happy to help push updates :)
@chenxwh Thanks a lot for the contribution! Could you also add me to the tiger-ai-lab
org(https://replicate.com/tiger-ai-lab) :)
Sure thing @lim142857! Just added you as well :D
Hi @chenxwh, I wonder if we can modify the demo to allow users to input their own edited_1st_frame
to override instruction prompt if provided? It seems a lot of users want to try out with their own edited first frame instead of the instructpix2pix output.
sure I will make the changes later today :)
I think letting people upload image probably causes too much overhead. People might need to visit other website to do it. It's a bit complex.
@chenxwh I'm wondering whether it's possible to breakdown the demo to two steps because the first-step result from instructpix2pix is not very stable. We can sweep several hparams (random seed, cfg params) to let instructpix2pix generate a few different images (They can even re-run this until they are happy with it). This should be quite cheap. Then a user can click on the most satisfied image to continue to do video generation. This will dramatically increase the success rate.
I think letting people upload image probably causes too much overhead. People might need to visit other website to do it. It's a bit complex.
@chenxwh I'm wondering whether it's possible to breakdown the demo to two steps because the first-step result from instructpix2pix is not very stable. We can sweep several hparams (random seed, cfg params) to let instructpix2pix generate a few different images (They can even re-run this until they are happy with it). This should be quite cheap. Then a user can click on the most satisfied image to continue to do video generation. This will dramatically increase the success rate.
The demo on the website only supports end of end inference. So I think the best way is to give option to use the default full pipeline or accept provided first frame that is obtained from the existing instructpix2pix model.
Hi @chenxwh, just discussed with @wenhuchen and we would love to stick to the original plan (modify the demo to allow users to input their own edited_1st_frame
to override instruction prompt if provided). Really appreciate your help :)
A new version is pushed to Replicate now :) and opened another PR for the change
Hi @vinesmsuic @lim142857 @wren93 ,
Very cool project on
AnyV2V
!This pull request makes it possible to run
AnyV2V
on Replicate (https://replicate.com/cjwbw/AnyV2V) and via API (https://replicate.com/cjwbw/AnyV2V/api). Currently, the demo includes prompt-based video editing. Also, we'd like to transfer the demo page/redirect toTIGER-AI-Lab
so you can make modifications easily, and happy to help maintain/integrate the upcoming changes :)