Open junsukha opened 4 days ago
I think that's a normal, normally it takes 4s for 93x352x640 , and your videos are more shorter.
@LinB203 thx for the reply! 4s per step for for 93x352x640? I see.
In my case, one step in the training phase means processing two videos or data samples simultaneously as I'm using two GPUs and 1 batch size per GPU. So it takes 2.xx secs per step or video (240 secs / 93 steps = 2.xx secs / step. because 1 epoch (93 steps) takes only around 4 minutes as I mentioned before). (If this doesn't make sense, just ignore it. I think I explained it poorly. or please correct me if I'm wrong)
My question is: But when I sample a video using the below config, it normally takes around 1-2 minutes to generate a video. Why does it take so long for inference compared to the training phase where, I think, it takes 2.xx seconds per video?
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1--master_port 29514 \
-m opensora.sample.sample \
--model_path path_to_check_point_model_ema \
--version v1_3 \
--num_frames 33 \
--height 352 \
--width 640 \
--cache_dir "../cache_dir" \
--text_encoder_name_1 "/storage/ongoing/new/Open-Sora-Plan/cache_dir/mt5-xxl" \
--text_prompt "examples/prompt.txt" \
--ae WFVAEModel_D8_4x8x8 \
--ae_path "/storage/lcm/WF-VAE/results/latent8" \
--save_img_path "./train_1_3_nomotion_fps18" \
--fps 16 \
--guidance_scale 7.5 \
--num_sampling_steps 100 \
--max_sequence_length 512 \
--sample_method EulerAncestralDiscrete \
--seed 1234 \
--num_samples_per_prompt 1 \
--rescale_betas_zero_snr \
--prediction_type "v_prediction"
You can see --num_sampling_steps 100
, which mean use 100 step to generate videos.
You can see
--num_sampling_steps 100
, which mean use 100 step to generate videos.
@LinB203 thx for the reply!
num_sampling_steps
is I believe the denoising steps. But doesn't the step
in the training phase have different meaning? The step
I'm referring in the training phase is this:
That's the fine tuning progress bar.
So you're saying one step in the training phase (the image I've attached above that shows the progress bar) is basically one denoising step? just like as the sampling step (parameter --num_sampling_steps
) in the inference phase?
The output Total optimization steps (num_update_steps_per_epoch) = 93
means that, I think, it takes 93 steps to use all the input videos once for training. If step
here is referring to denoising step, I don't think it make sense because it only requires 93 denoising steps while using all the input videos once for training?
oh I think I got it. U right.
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
from train_t2v_diffusers.py
says num_updates_per_epoch
is the number of batches (len(train_dataloader)) as I'm using 1 for args.gradient_accumulation_steps
.
Since I'm using 2 gpus and 1 batch size per gpu, my batch size is 2 in total. So num_examples (183) / total batch size (2) gives 93. So num_updates_per_epoch
is 93.
Also, there's one step training per batch according to the code (I think), which makes 93 steps in total per epoch (93 batches in an epoch). One step here is denoising step for a given timestep. So one step in the training phase is basically the same as a step in the sampling phase (--num_sampling_steps
).
Hi,
I'm fine-tuning v1.3 any93x640x640 (https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/any93x640x640) with the videos of 352x640 (height, width), fps 16.
I see that 1 epoch (93 steps) takes only around 4 minutes. Is this expected? I think it takes too short amount of time. I'm using 2 A100 gpus. 1 batch size per gpu.
Below I provide a part of json that consists of video data.
Below is the part of the output in the terminal during the training process.
Below is the arguments I used: