junsukha commented 4 weeks ago

Hi,

I'm fine-tuning v1.3 any93x640x640 (https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/any93x640x640) with the videos of 352x640 (height, width), fps 16.

I see that 1 epoch (93 steps) takes only around 4 minutes. Is this expected? I think it takes too short amount of time. I'm using 2 A100 gpus. 1 batch size per gpu.

Below I provide a part of json that consists of video data.

[
    {
        "path": "/gpfs/vision/drag_video/0_datasets/open-sora-plan/videos/car-centric/encoded/모닝1-seg11.mp4",
        "cap": "A family is seen standing together outdoors, followed by a sleek white car driving smoothly across a modern bridge. The car is highlighted as the \"All New Smart Compact Morning.\"",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 66,
        "fps": 16
    },
    {
        "path": "/gpfs/vision/drag_video/0_datasets/open-sora-plan/videos/car-centric/encoded/베뉴1-seg48.mp4",
        "cap": "Two cars are seen driving on a dimly lit road, with one car passing the other. The scene transitions to a wide shot of a car driving towards a city skyline at dusk, highlighting the vehicle's rear design and branding.",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 78,
        "fps": 16
    },
    {
        "path": "/gpfs/vision/drag_video/0_datasets/open-sora-plan/videos/car-centric/encoded/모닝1-seg05.mp4",
        "cap": "The commercial showcases a sleek, white Kia Morning car, highlighting its modern design and stylish features as it drives through an urban environment. The tagline \"happy new morning\" emphasizes a fresh and positive start with this vehicle.",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 53,
        "fps": 16
    },
    {
        "path": "/gpfs/vision/drag_video/0_datasets/open-sora-plan/videos/car-centric/encoded/k7_1-seg5.mp4",
        "cap": "A sleek, dark-colored sedan is showcased driving smoothly on a modern bridge, highlighting its elegant design and emphasizing its award for being ranked first in the 2014 J.D. Power Initial Quality Study for large cars.",
        "resolution": {
            "height": 352,
            "width": 640
        },
        "num_frames": 36,
        "fps": 16
    },
...
]

Below is the part of the output in the terminal during the training process.

too_long: 25, too_short: 50
cnt_img_res_mismatch_stride: 0, cnt_vid_res_mismatch_stride: 0
cnt_img_res_too_small: 0, cnt_vid_res_too_small: 0
cnt_img_aspect_mismatch: 0, cnt_vid_aspect_mismatch: 0
cnt_filter_minority: 0
Counter(sample_size): Counter({'33x352x640': 170, '29x352x640': 17})

10/29/2024 02:26:11 - INFO - __main__ -   Num examples = 187
10/29/2024 02:26:11 - INFO - __main__ -   Num Epochs = 1000
10/29/2024 02:26:11 - INFO - __main__ -   Instantaneous batch size per device = 1
10/29/2024 02:26:11 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
10/29/2024 02:26:11 - INFO - __main__ -   Gradient Accumulation steps = 1
10/29/2024 02:26:11 - INFO - __main__ -   Total optimization steps = 93000
10/29/2024 02:26:11 - INFO - __main__ -   Total optimization steps (num_update_steps_per_epoch) = 93
10/29/2024 02:26:11 - INFO - __main__ -   Total training parameters = 2.7719816 B
10/29/2024 02:26:11 - INFO - __main__ -   AutoEncoder = WFVAEModel_D8_4x8x8; Dtype = torch.bfloat16; Parameters = 0.147347724 B
10/29/2024 02:26:11 - INFO - __main__ -   Text_enc_1 = /mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl; Dtype = torch.bfloat16; Parameters = 5.65517312 B
Checkpoint 'latest' does not exist. Starting a new training run.

Below is the arguments I used:

            "args": [
                "--config_file",
                "scripts/accelerate_configs/deepspeed_zero2_config.yaml",
                "opensora/train/train_t2v_diffusers.py",
                "--model=OpenSoraT2V_v1_3-2B/122",
                "--text_encoder_name_1=/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl",
                "--cache_dir=../../cache_dir/",
                "--dataset=t2v",
                "--data=/mnt/singularity_home/jsha/repos/Open-Sora-Plan/fine_tuning/data.txt",
                "--ae=WFVAEModel_D8_4x8x8",
                "--ae_path",
                "/gpfs/vision/drag_video/HF_downloads/Open-Sora-Plan-v1.3.0/vae",
                "--sample_rate",
                "1",
                "--num_frames",
                "33",
                "--max_height",
                "352",
                "--max_width",
                "640",
                "--interpolation_scale_t",
                "1.0",
                "--interpolation_scale_h",
                "1.0",
                "--interpolation_scale_w",
                "1.0",
                "--gradient_checkpointing",
                "--train_batch_size",
                "1",
                "--dataloader_num_workers",
                "16",
                "--gradient_accumulation_steps",
                "1",
                // "--max_train_steps","100" ,
                "--learning_rate",
                "1e-5",
                "--lr_scheduler",
                "constant",
                "--lr_warmup_steps",
                "0",
                "--mixed_precision=bf16",
                "--report_to=tensorboard",
                "--checkpointing_steps=500",
                "--allow_tf32",
                "--model_max_length",
                "512",
                "--use_ema",
                "--ema_start_step",
                "0",
                "--cfg",
                " 0.1",
                "--resume_from_checkpoint=latest",
                "--speed_factor",
                "1.0",
                "--ema_decay",
                " 0.9999",
                "--drop_short_ratio",
                "0.0",
                // "--pretrained",
                // "",
                "--hw_stride",
                "32",
                "--sparse1d",
                "--sparse_n",
                "4",
                "--train_fps",
                "16",
                "--seed",
                "1234",
                "--trained_data_global_step",
                "0",
                "--group_data",
                "--use_decord",
                "--prediction_type",
                "v_prediction",
                "--snr_gamma",
                "5.0",
                "--force_resolution",
                "--rescale_betas_zero_snr",
                "--output_dir",
                "/mnt/singularity_home/jsha/repos/Open-Sora-Plan/output/fine_tuning/encoded-videos",
                "--pretrained=/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/Open-Sora-Plan-v1.3.0/any93x640x640",
                "--num_train_epochs=1000",
                "--checkpoints_total_limit=10"
                // "--sp_size=2", 
                // "--train_sp_batch_size=1"
            ],

LinB203 commented 4 weeks ago

I think that's a normal, normally it takes 4s for 93x352x640 , and your videos are more shorter.

junsukha commented 3 weeks ago

@LinB203 thx for the reply! 4s per step for for 93x352x640? I see.

In my case, one step in the training phase means processing two videos or data samples simultaneously as I'm using two GPUs and 1 batch size per GPU. So it takes 2.xx secs per step or video (240 secs / 93 steps = 2.xx secs / step. because 1 epoch (93 steps) takes only around 4 minutes as I mentioned before). (If this doesn't make sense, just ignore it. I think I explained it poorly. or please correct me if I'm wrong)

My question is: But when I sample a video using the below config, it normally takes around 1-2 minutes to generate a video. Why does it take so long for inference compared to the training phase where, I think, it takes 2.xx seconds per video?

CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1--master_port 29514 \
    -m opensora.sample.sample \
    --model_path path_to_check_point_model_ema \
    --version v1_3 \
    --num_frames 33 \
    --height 352 \
    --width 640 \
    --cache_dir "../cache_dir" \
    --text_encoder_name_1 "/storage/ongoing/new/Open-Sora-Plan/cache_dir/mt5-xxl" \
    --text_prompt "examples/prompt.txt" \
    --ae WFVAEModel_D8_4x8x8 \
    --ae_path "/storage/lcm/WF-VAE/results/latent8" \
    --save_img_path "./train_1_3_nomotion_fps18" \
    --fps 16 \
    --guidance_scale 7.5 \
    --num_sampling_steps 100 \
    --max_sequence_length 512 \
    --sample_method EulerAncestralDiscrete \
    --seed 1234 \
    --num_samples_per_prompt 1 \
    --rescale_betas_zero_snr \
    --prediction_type "v_prediction"

LinB203 commented 3 weeks ago

You can see --num_sampling_steps 100, which mean use 100 step to generate videos.

junsukha commented 3 weeks ago

You can see --num_sampling_steps 100, which mean use 100 step to generate videos.

@LinB203 thx for the reply!

num_sampling_steps is I believe the denoising steps. But doesn't the step in the training phase have different meaning? The step I'm referring in the training phase is this: That's the fine tuning progress bar.

So you're saying one step in the training phase (the image I've attached above that shows the progress bar) is basically one denoising step? just like as the sampling step (parameter --num_sampling_steps) in the inference phase?

The output Total optimization steps (num_update_steps_per_epoch) = 93 means that, I think, it takes 93 steps to use all the input videos once for training. If step here is referring to denoising step, I don't think it make sense because it only requires 93 denoising steps while using all the input videos once for training?

UPDATE

oh I think I got it. U right. num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) from train_t2v_diffusers.py says num_updates_per_epoch is the number of batches (len(train_dataloader)) as I'm using 1 for args.gradient_accumulation_steps. Since I'm using 2 gpus and 1 batch size per gpu, my batch size is 2 in total. So num_examples (183) / total batch size (2) gives 93. So num_updates_per_epoch is 93.

Also, there's one step training per batch according to the code (I think), which makes 93 steps in total per epoch (93 batches in an epoch). One step here is denoising step for a given timestep. So one step in the training phase is basically the same as a step in the sampling phase (--num_sampling_steps).

PKU-YuanGroup / Open-Sora-Plan

v1.3 fine tuning duration too short #516

UPDATE