LPengYang / MotionClone

Official implementation of MotionClone: Training-Free Motion Cloning for Controllable Video Generation
402 stars 31 forks source link

Gets an all-nan video #6

Closed askerlee closed 4 months ago

askerlee commented 4 months ago

I tried to do inference with the following command: python3 sample.py --config configs/inference_config/astronaut.yaml (The only thing I changed is used a revised version of RealisticVision, and fixed some mismatching ckpt key names)

It produces a video tensor in which all elements are nan:

sample.py:line 79
(Pdb) np.isnan(videos).sum()
12582912
(Pdb) videos.size
12582912

I've checked and made sure the inversion file inversion/inverted_data_astronaut.pkl doesn't contain nan values (the 'all_latents_inversion' and 'inversion_prompt_embeds' tensors look normal).

Any thought why this might happen? Thanks.

Bujiazi commented 4 months ago

Could you please confirm if the inference works correctly when using the recommended RealisticVision version?

askerlee commented 4 months ago

I just tried the recommended version and it's the same. BTW I used the latest diffusers 0.29.2 and pytorch 1.13. But I guess that won't cause nan....

Bujiazi commented 4 months ago

When using the environment we provided (environment.yaml), have you encountered similar issues?

askerlee commented 4 months ago

I did some simple debugging. The values in latents_group (pipeline.py:L757) seems to increase quickly and explode eventually. After one iteration, latents_group.abs().max() is 40, then 41, 42, ... I tried to address this issue by setting grad_guidance_threshold to 0.1, however just got a messy video: https://github.com/Bujiazi/MotionClone/assets/1575461/728f1f01-ff10-448f-917f-112ca7414329

Now I change grad_guidance_threshold to 1, but it seems it's still going to explode (just a bit slower than not setting grad_guidance_threshold) tensor(48.9375, device='cuda:0', dtype=torch.float16) 15%|██████████▏ | 77/500 [02:50<15:39, 2.22s/it]tensor(49.4375, device='cuda:0', dtype=torch.float16) 16%|██████████▎ | 78/500 [02:52<15:37, 2.22s/it]tensor(49.8750, device='cuda:0', dtype=torch.float16) 16%|██████████▍ | 79/500 [02:54<15:36, 2.22s/it]tensor(50.3125, device='cuda:0', dtype=torch.float16) 16%|██████████▌ | 80/500 [02:57<15:35, 2.23s/it]tensor(50.5000, device='cuda:0', dtype=torch.float16) 16%|██████████▋ | 81/500 [02:59<15:32, 2.23s/it]tensor(51.2812, device='cuda:0', dtype=torch.float16) 16%|██████████▊ | 82/500 [03:01<15:31, 2.23s/it]tensor(51.4375, device='cuda:0', dtype=torch.float16) 17%|██████████▉ | 83/500 [03:03<15:28, 2.23s/it]tensor(51.8750, device='cuda:0', dtype=torch.float16) 17%|███████████ | 84/500 [03:06<15:27, 2.23s/it]tensor(53.2188, device='cuda:0', dtype=torch.float16) 17%|███████████▏ | 85/500 [03:08<15:22, 2.22s/it]tensor(52.9062, device='cuda:0', dtype=torch.float16) 17%|███████████▎ | 86/500 [03:10<15:19, 2.22s/it]tensor(53.6562, device='cuda:0', dtype=torch.float16) 17%|███████████▍ | 87/500 [03:12<15:17, 2.22s/it]tensor(55.1562, device='cuda:0', dtype=torch.float16) 18%|███████████▌ | 88/500 [03:14<15:15, 2.22s/it]tensor(55.6562, device='cuda:0', dtype=torch.float16) 18%|███████████▋ | 89/500 [03:17<15:10, 2.22s/it]tensor(55.9062, device='cuda:0', dtype=torch.float16)

EDIT: after 500 iterations the max value is 464 and I got another messy video: https://github.com/Bujiazi/MotionClone/assets/1575461/09b756d5-8a73-41fd-a1fa-4bb0c22cf185

Bujiazi commented 4 months ago

Thanks for the feedback. Due to discrepancies between the versions of Diffusers and Torch in your environment and the versions we recommend, you may encounter some unexpected issues 😂. We strongly recommend using the environment we have specified: conda env create -f environment.yaml

askerlee commented 4 months ago

Thanks. Yeah I just used the recommended environment and seems the tensor values are normal. Would update once it's finished.

askerlee commented 4 months ago

Nice, eventually got the right video! https://github.com/Bujiazi/MotionClone/assets/1575461/9c96e7c9-fcf2-4538-b227-6b8206ec5aae

It's a bit dark but I'm happy it works. Any method to make it brighter? Maybe add "bright lighting" to the prompt?

askerlee commented 4 months ago

BTW the first 300 iterations are pretty slow (2.22s/it on A6000), but at the last 200 iters it's 1.47it/s (3.26x of the first 300 iters). Seems the prompt conditioning takes a lot of time. Have you tried to apply prompt conditioning once every N iterations to speed things up?

Bujiazi commented 4 months ago

It is great to see that you have successfully run MotionClone 😄. Feel free to try various prompts. When we ran the astronaut example, we successfully obtained bright result like this: image

askerlee commented 4 months ago

Guidance on every step of the first 300 steps. Took 802s. https://github.com/Bujiazi/MotionClone/assets/1575461/5533e38b-3c52-4cad-a836-e7d800bfe3a9

Guidance on 1 out of every 3 steps of the first 300 steps. Took 493s (40% speed up). https://github.com/Bujiazi/MotionClone/assets/1575461/1c940837-78ed-4a1d-a5d4-55cffe799969

The cross-frame consistency looks worse. Despite that, it looks ok.

Guidance once every 2 steps of the first 300 steps (skip half). Now it looks as good as without skipping, but only takes 572s (30% speed up) https://github.com/Bujiazi/MotionClone/assets/1575461/4d6ea298-b792-4bcc-ab77-51f012f9d5af

Bujiazi commented 4 months ago

Thank you very much for your exploration 🌹, the results are indeed impressive 😘. We also attempted a similar skipping mechanism in the early stages of our experiments, but found that it was not very stable and had a probability of failing in certain cases, maybe the steps we skipped were too large. We will consider incorporating a stable skipping mechanism in future optimized versions to accelerate inference.

askerlee commented 4 months ago

Glad that it helps! Will go back to integrate my component 😄

LPengYang commented 1 month ago

Glad that it helps! Will go back to integrate my component 😄

We have updated the code. Now MotionClone is able to 1) directly performs motion customization without cumbersome video inversion ; 2) significantly reduces memory consumption. In our experiments, For 16×512×512 text-to-video, the memory consumption is about 14GB . For MotionClone combined with image-to-video or sketch-to-video, the memory is about 22 GB. Hope this helps.