Weifeng-Chen / control-a-video

Official Implementation of "Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models"
GNU General Public License v3.0
359 stars 26 forks source link

The size of tensor a (4) must match the size of tensor b (8) #19

Open G-force78 opened 1 year ago

G-force78 commented 1 year ago

Using these arguments

!python3 /content/control-a-video/inference.py --prompt "a bear practicing kungfu, with a background of mountains" --input_video /content/kungfubear.mp4 --control_mode depth --num_sample_frames 24 --inference_step 10 --guidance_scale 5 --init_noise_thres 0.75

FPS 8 output demo.gif

_/content/control-a-video/inference.py:119 in │ │ │ │ 116 │ │ 117 out = [] │ │ 118 for i in range(num_sample_frames//each_sample_frame): │ │ ❱ 119 │ out1 = video_controlnet_pipe( │ │ 120 │ │ │ # controlnet_hint= control_maps[:,:,:each_sample_frame,:,: │ │ 121 │ │ │ # images= v2v_input_frames[:,:,:each_sample_frame,:,:], │ │ 122 │ │ │ controlnet_hint=control_maps[:,:,ieach_sample_frame-1:(i+ │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/autograd/grad_mode.py:27 in │ │ decorate_context │ │ │ │ 24 │ │ @functools.wraps(func) │ │ 25 │ │ def decorate_context(args, kwargs): │ │ 26 │ │ │ with self.clone(): │ │ ❱ 27 │ │ │ │ return func(*args, *kwargs) │ │ 28 │ │ return cast(F, decorate_context) │ │ 29 │ │ │ 30 │ def _wrap_generator(self, func): │ │ │ │ /content/control-a-video/model/video_diffusion/pipelines/pipeline_stable_dif │ │ fusion_controlnet3d.py:418 in call │ │ │ │ 415 │ │ │ │ │ if controlhint_in_uncond: │ │ 416 │ │ │ │ │ │ control_maps_single_frame = control_maps_singl │ │ 417 │ │ │ │ │ │ │ ❱ 418 │ │ │ │ │ down_block_res_samples_single_frame, mid_block_res │ │ 419 │ │ │ │ │ │ │ │ latent_model_input_single_frame, │ │ 420 │ │ │ │ │ │ │ │ t, │ │ 421 │ │ │ │ │ │ │ │ encoder_hidden_states=textembeddings │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1194 in │ │ _call_impl │ │ │ │ 1191 │ │ # this function, and just call forward. │ │ 1192 │ │ if not (self._backward_hooks or self._forwardhooks or self. │ │ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │ │ ❱ 1194 │ │ │ return forward_call(input, kwargs) │ │ 1195 │ │ # Do not call functions when jit is used │ │ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /content/control-a-video/model/video_diffusion/models/controlnet3d.py:464 in │ │ forward │ │ │ │ 461 │ │ controlnet_cond = self.controlnet_cond_embedding(controlnet_co │ │ 462 │ │ # print(sample.shape, controlnet_cond.shape) │ │ 463 │ │ │ │ ❱ 464 │ │ sample += controlnet_cond │ │ 465 │ │ # 3. down │ │ 466 │ │ │ │ 467 │ │ down_block_ressamples = (sample,)

G-force78 commented 1 year ago

What is the relationship between fps, num_sample_frames and length of output video? Also what does --sampling_rate: skip sampling from the input video actually mean? I notice the default value is 3, what does this do?

Cool setup by the way its like an opensource version of runway Gen1 I imagine they used similar tricks and just have many GPU to run it.

Weifeng-Chen commented 1 year ago

the name should be fix. current may not be understood..