THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.29k stars 874 forks source link

How to fully fine-tune CogVideoX1.5-5B-I2V? #526

Open JWargrave opened 2 days ago

JWargrave commented 2 days ago

Can I modify the examples/cogvideo/train_cogvideox_image_to_video_lora.py in diffusers into the fully fine-tuned one? Since the training script sat/finetune_multi_gpus.sh has some bug.

zRzRzRzRzRzRzR commented 2 days ago

We are doing fine-tuning work on the cogvideox-factory, please use the diffusers version for fine-tuning, as the sat version will run out of memory (OOM) even if it succeeds.

JWargrave commented 2 days ago

We are doing fine-tuning work on the cogvideox-factory, please use the diffusers version for fine-tuning, as the sat version will run out of memory (OOM) even if it succeeds.

Does the diffusers version refer to train_cogvideox_image_to_video_lora.py? I ran it according to the instructions here, but encountered the following error:

[rank7]: Traceback (most recent call last):
[rank7]:   File "/suqinzs/jwargrave/CogVideo-305/sat/train_cogvideox_image_to_video_lora.py", line 1620, in <module>
[rank7]:     main(args)
[rank7]:   File "/suqinzs/jwargrave/CogVideo-305/sat/train_cogvideox_image_to_video_lora.py", line 1428, in main
[rank7]:     model_output = transformer(
[rank7]:                    ^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
[rank7]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank7]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
[rank7]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/accelerate/utils/operations.py", line 823, in forward
[rank7]:     return model_forward(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/accelerate/utils/operations.py", line 811, in __call__
[rank7]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank7]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/jwargrave/diffusers-4729/src/diffusers/models/transformers/cogvideox_transformer_3d.py", line 470, in forward
[rank7]:     ofs_emb = self.ofs_proj(ofs)
[rank7]:               ^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/anaconda3/envs/zym-cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/jwargrave/diffusers-4729/src/diffusers/models/embeddings.py", line 928, in forward
[rank7]:     t_emb = get_timestep_embedding(
[rank7]:             ^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/suqinzs/jwargrave/diffusers-4729/src/diffusers/models/embeddings.py", line 54, in get_timestep_embedding
[rank7]:     assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array"
[rank7]:                ^^^^^^^^^^^^^^^
[rank7]: AttributeError: 'NoneType' object has no attribute 'shape'

The training script is as follows (Note that the pretrained_model_name_or_path is CogVideoX1.5-5B-I2V):

#!/bin/bash

clear

GPU_IDS="0,1,2,3,4,5,6,7"

accelerate launch --gpu_ids $GPU_IDS train_cogvideox_image_to_video_lora.py \
  --pretrained_model_name_or_path ./pretrained_weights/CogVideoX1.5-5B-I2V \
  --cache_dir ./cache \
  --instance_data_root ./storyboard_data_for_cog \
  --caption_colum prompts_cogvlm2_debug.txt \
  --video_column videos_debug.txt \
  --validation_prompt "A woman is sitting in a basket under a tree. She is holding a pink flower and looking at it. The basket is made of straw and the woman is wearing a white dress. There are green leaves on the ground around her." \
  --validation_images "a.jpg" \
  --num_validation_videos 1 \
  --validation_epochs 10 \
  --seed 42 \
  --rank 64 \
  --lora_alpha 64 \
  --mixed_precision fp16 \
  --output_dir ./output-cogvideox-lora \
  --height 480 --width 720 --fps 8 --max_num_frames 49 --skip_frames_start 0 --skip_frames_end 0 \
  --train_batch_size 1 \
  --num_train_epochs 30 \
  --checkpointing_steps 1000 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-3 \
  --lr_scheduler cosine_with_restarts \
  --lr_warmup_steps 200 \
  --lr_num_cycles 1 \
  --enable_slicing \
  --enable_tiling \
  --optimizer Adam \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --max_grad_norm 1.0
zRzRzRzRzRzRzR commented 17 hours ago

Need to check it out @zhipuch

lijain commented 15 hours ago

We are doing fine-tuning work on the cogvideox-factory, please use the diffusers version for fine-tuning, as the sat version will run out of memory (OOM) even if it succeeds.

你好,那目前的SAT版本显存76G,是因为这个是全部训练,diffusers是lora ft是?