junsukha commented 3 weeks ago

I2V.md says that

In the current version, we have only open-sourced the 93x480p version of the Image-to-Video (I2V) model.

So I presume the open-sourced version mentioned here is any93x640x640_i2v (https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/any93x640x640_i2v)

Then, do you recommend to use the videos of 480p (e.g. 640x480, wxh) for I2V fine-tuning? I used videos of 640x352 resolution for I2V fine-tuning and the sampled results don't seem to look good. Wondering if the video shape used for fine-tuning impacts the result.

EDIT

Just checked v1.2 and it seems the 93x480p version is referring to v1.2 weight.

v1.3 has trained I2V for two stages, and the second stage is

Stage 2: Any resolution and duration within 93x236544 (e.g., 480x480, 640x352, 352x640), using filtered motion and aesthetic high-quality data:

So that implies I can use 640x352 resolution videos for I2V fine-tuning, I guess?

yunyangge commented 2 weeks ago

During training, we support resolutions of 640x480 (4:3) and 640x352 (16:9). However, since the majority of videos in the training set are in 16:9 format, the 640x352 shape will yield the best inference results. We recommend choosing a resolution of 640x352 for training. If your training data requires a 640x480 resolution, please ensure that the data volume is sufficiently ample.

yunyangge commented 2 weeks ago

If possible, could you please provide your configuration details (such as data volume, learning rate, and batch size) along with the sampling results? This will allow us to conduct a more accurate analysis.

junsukha commented 2 weeks ago

If possible, could you please provide your configuration details (such as data volume, learning rate, and batch size) along with the sampling results? This will allow us to conduct a more accurate analysis.

@yunyangge I came up with a couple questions while writing the reply.

Would learning rate 1e-6 perform better for fine-tuning?
I used --num_frames 33 while training but sampled 93 frames. Does this affect the result? If I wanted to sample a video of 93 frames, should I train with --num_frames 93?

Data volume: I have 261 items in json file. Each item consists of the following format. All items are of the same resolution and fps. I use this data 3408 times, i.e., copy & paste the same line 3408 times in data.txt which is used for --data parameter.

    {
        "path": "seg11.mp4",
        "cap": "The commercial showcases the sleek design and stylish features of the Genesis G80 Sport, highlighting its modern headlights and elegant rear, followed by a dynamic driving scene on a dark road. The price range of the car is displayed as 67,440,000 to 69,890,000 KRW.",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 328,
        "fps": 16
    }

Below is the output I get when the I run the training process.

no_cap: 0, no_resolution: 0
too_long: 92016, too_short: 98832
cnt_img_res_mismatch_stride: 0, cnt_vid_res_mismatch_stride: 0
cnt_img_res_too_small: 0, cnt_vid_res_too_small: 0
cnt_img_aspect_mismatch: 0, cnt_vid_aspect_mismatch: 0
cnt_filter_minority: 0
Counter(sample_size): Counter({'33x352x640': 630480, '29x352x640': 71568})
cnt_vid: 892896, cnt_vid_after_filter: 702048, use_ratio: 78.62599999999999%
cnt_img: 0, cnt_img_after_filter: 0, use_ratio: 0.0%
before filter: 892896, after filter: 702048, use_ratio: 78.62599999999999%
Build data time: 25.826427459716797
n_elements: 702048
Data length: 702048

learning rate: I used 1e-5. Should I use 1e-6?

batch size: Total batch size is 3. I use three A100 gpus. One sample per gpu. Refer to this:

11/11/2024 21:50:38 - INFO - __main__ -   Num examples = 702048
11/11/2024 21:50:38 - INFO - __main__ -   Num Epochs = 5
11/11/2024 21:50:38 - INFO - __main__ -   Instantaneous batch size per device = 1
11/11/2024 21:50:38 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
11/11/2024 21:50:38 - INFO - __main__ -   Gradient Accumulation steps = 1
11/11/2024 21:50:38 - INFO - __main__ -   Total optimization steps = 1000000
11/11/2024 21:50:38 - INFO - __main__ -   Total optimization steps (num_update_steps_per_epoch) = 234016
11/11/2024 21:50:38 - INFO - __main__ -   Total training parameters = 2.782713632 B
11/11/2024 21:50:38 - INFO - __main__ -   AutoEncoder = WFVAEModel_D8_4x8x8; Dtype = torch.bfloat16; Parameters = 0.147347724 B
11/11/2024 21:50:38 - INFO - __main__ -   Text_enc_1 = /mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl; Dtype = torch.bfloat16; Parameters = 5.65517312 B

Sampled video. Note I used --num_frames 33 while training but sampled 93 frames. k5_1-seg2-00 00 00 000

https://github.com/user-attachments/assets/bbca49bd-9546-4f44-ab1a-cf2b37248688

FYI, below is the configs I used for training and sampling. Training config

# car centric
export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1
export PDSH_RCMD_TYPE=ssh
# NCCL setting
export GLOO_SOCKET_IFNAME=bond0
export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_HCA=mlx5_10:1,mlx5_11:1,mlx5_12:1,mlx5_13:1
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=162
export NCCL_IB_TIMEOUT=25
export NCCL_PXN_DISABLE=0
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_ALGO=Ring
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export NCCL_IB_RETRY_CNT=32
# export NCCL_ALGO=Tree
export WANDB_MODE=dryrun

CUDA_VISIBLE_DEVICES=0,1,2 accelerate launch \
    --config_file scripts/accelerate_configs/deepspeed_zero2_config.yaml \
    opensora/train/train_inpaint.py \
    --model OpenSoraInpaint_v1_3-2B/122 \
    --text_encoder_name_1 /mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl \
    --cache_dir "../../cache_dir/" \
    --dataset inpaint \
    --data "/mnt/singularity_home/jsha/repos/Open-Sora-Plan/fine_tuning/car-centric-only-data.txt" \
    --ae WFVAEModel_D8_4x8x8 \
    --ae_path "/gpfs/vision/drag_video/HF_downloads/Open-Sora-Plan-v1.3.0/vae" \
    --sample_rate 1 \
    --num_frames 33 \
    --max_height 352 \
    --max_width 640 \
    --max_hxw 307200 \
    --min_hxw 102400 \
    --interpolation_scale_t 1.0 \
    --interpolation_scale_h 1.0 \
    --interpolation_scale_w 1.0 \
    --gradient_checkpointing \
    --train_batch_size=1 \
    --dataloader_num_workers 8 \
    --gradient_accumulation_steps=1 \
    --max_train_steps=1000000 \
    --learning_rate=1e-6 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --mixed_precision="bf16" \
    --report_to="wandb" \
    --checkpointing_steps=2500 \
    --allow_tf32 \
    --model_max_length 512 \
    --use_ema \
    --ema_start_step 0 \
    --cfg 0.1 \
    --resume_from_checkpoint="latest" \
    --speed_factor 1.0 \
    --ema_decay 0.9999 \
    --drop_short_ratio 0.0 \
    --hw_stride 32 \
    --sparse1d --sparse_n 4 \
    --train_fps 18 \
    --seed 1234 \
    --trained_data_global_step 0 \
    --group_data \
    --use_decord \
    --prediction_type "v_prediction" \
    --output_dir="output/fine_tuning/i2v/car_centric_after_gitpull/" \
    --rescale_betas_zero_snr \
    --mask_config scripts/train_configs/mask_config.yaml \
    --add_noise_to_condition \
    --default_text_ratio 0.5  \
    --num_train_epochs=1000 \
    --checkpoints_total_limit=10 \
    --pretrained="/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/Open-Sora-Plan-v1.3.0/any93x640x640_i2v" \

Mask config

# mask processor args
min_clear_ratio: 0.0
max_clear_ratio: 1.0 

# mask_type_ratio_dict_video
mask_type_ratio_dict_video:
  t2iv: 1
  i2v: 8
  transition: 8
  continuation: 2
  clear: 0
  random_temporal: 1

mask_type_ratio_dict_image:
  t2iv: 0
  clear: 0

Sampling config

CUDA_VISIBLE_DEVICES=4 torchrun --nnodes=1 --master_port 29513 \
    -m opensora.sample.sample \
    --model_type "inpaint" \
    --model_path "output/fine_tuning/i2v/car_centric_after_gitpull/checkpoint-40000/model_ema" \
    --version v1_3 \
    --num_frames 93 \
    --height 352 \
    --width 640 \
    --max_hxw 236544 \
    --crop_for_hw \
    --cache_dir "../cache_dir" \
    --text_encoder_name_1 "weights/google/mt5-xxl" \
    --text_prompt examples/cond_prompt.txt \
    --conditional_pixel_values_path examples/cond_pix_path.txt \
    --ae WFVAEModel_D8_4x8x8 \
    --ae_path "/gpfs/vision/drag_video/HF_downloads/Open-Sora-Plan-v1.3.0/vae" \
    --save_img_path "./save_path/i2v_finetuned/car_centric/checkpoint40000" \
    --fps 16 \
    --guidance_scale 7.5 \
    --num_sampling_steps 100 \
    --max_sequence_length 512 \
    --sample_method EulerAncestralDiscrete \
    --seed 1234 \
    --num_samples_per_prompt 1 \
    --rescale_betas_zero_snr \
    --prediction_type "v_prediction" \
    --noise_strength 0.0 \
    --mask_type "i2v" \
    # --model_path "weights/Open-Sora-Plan-v1.3.0/any93x640x640_i2v" \

yunyangge commented 2 weeks ago

Thank you for providing the information. Here are the suggestions:

Check the dataset you are using. We require that there be no jump cuts in the training videos. I noticed some jump cuts in your sample results, so you need to process the dataset accordingly.
Use a larger batch size. We recommend a batch size of at least 16, which you can achieve through gradient accumulation.
Keep the frame count consistent between training and inference. If most of your data has 33 frames during training and has been repeated over too many epochs, the model weights are likely overfitted to 33 frames. Using these weights for inference on 93 frames may yield poor results.
If possible, prepare more data. We believe that for full fine-tuning, a set of over 10,000 clips is a more reliable choice.

yunyangge commented 2 weeks ago

Additionally, if you only need to train the image-to-video (i2v) task, you can set the i2v value in the mask config to 1 and set the other values to 0. This way, all data will be processed in the i2v format.

junsukha commented 2 weeks ago

Thank you for providing the information. Here are the suggestions:

Check the dataset you are using. We require that there be no jump cuts in the training videos. I noticed some jump cuts in your sample results, so you need to process the dataset accordingly.

Use a larger batch size. We recommend a batch size of at least 16, which you can achieve through gradient accumulation.

Keep the frame count consistent between training and inference. If most of your data has 33 frames during training and has been repeated over too many epochs, the model weights are likely overfitted to 33 frames. Using these weights for inference on 93 frames may yield poor results.

If possible, prepare more data. We believe that for full fine-tuning, a set of over 10,000 clips is a more reliable choice.

@yunyangge Thx for the tips! As for the third tip, if I set --num_frames=33, does that mean it will use only 33 frames of each of my videos while training? If so, if I wanted to set -num_frames=93, I should use videos of at least 93 frames?

yunyangge commented 2 weeks ago

If you set --num_frames=n, the actual number of frames trained will be between 29 and n and must be a format of 4k+1. If the number of frames in your original video is greater than --num_frames, it will be truncated to num_frames. Therefore, if you set --num_frames to 93, the number of frames in your original video needs to be at least 93. I hope the above answer is helpful to you.

PKU-YuanGroup / Open-Sora-Plan

recommended to use videos of 480p for I2V fine-tuning? #532

EDIT