aigc-apps / CogVideoX-Fun

📹 A more flexible CogVideoX that can generate videos at any resolution and creates videos from images.
Apache License 2.0
498 stars 33 forks source link

OOM on H100 when finetuning the 5b inpaint model #56

Open Closertodeath opened 3 weeks ago

Closertodeath commented 3 weeks ago

Hello, Currently, I've been facing issues with finetuning the 5b-inpaint model on an H100. Using deepspeed with your provided config will cause the trainer to require 90gb of VRAM. Not using deepspeed drops the VRAM usage by quite a bit, but the trainer is spiking by 10-12gb+ more every other step. Could you please help me debug the issue? I've tried different datasets, config settings and I find myself hitting a wall. So far this is the config that's gotten me the farthest with finetuning but I still oom around the 8th step. Is it not possible to finetune the 5b model on 80gb of VRAM?

accelerate launch --mixed-precision="bf16" scripts/train.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATASET_NAME \
  --train_data_meta=$DATASET_META_NAME \
  --image_sample_size=1280 \
  --video_sample_size=256 \
  --token_sample_size=512 \
  --video_sample_stride=3 \
  --video_sample_n_frames=49 \
  --train_batch_size=1 \
  --video_repeat=1 \
  --gradient_accumulation_steps=1 \
  --dataloader_num_workers=8 \
  --num_train_epochs=100 \
  --checkpointing_steps=50 \
  --learning_rate=2e-05 \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=100 \
  --seed=42 \
  --output_dir="output_dir" \
  --gradient_checkpointing \
  --mixed_precision="bf16" \
  --adam_weight_decay=3e-2 \
  --adam_epsilon=1e-10 \
  --vae_mini_batch=1 \
  --max_grad_norm=0.05 \
  --random_hw_adapt \
  --training_with_video_token_length \
  --random_frame_crop \
  --enable_bucket \
  --use_ema \
  --train_mode="inpaint" \
  --trainable_modules "." \
  --low_vram \
  --use_8bit_adam
CacacaLalala commented 3 weeks ago

Same problem

ArEnSc commented 3 weeks ago

@yunkchen I am too having the same problem

bubbliiiing commented 3 weeks ago

Please use DeepSpeed

bubbliiiing commented 3 weeks ago

https://github.com/aigc-apps/CogVideoX-Fun/blob/4de9773025a621824172ff61b5b7963ac18fd0e5/scripts/README_TRAIN.md

ArEnSc commented 3 weeks ago

@bubbliiiing one more question

What do you mean by training by 512 by 512 and that the image token length is 512x512 and 49 frames is 13,312.

I am assuming you mean 512 by 512 image and not latent? or do you mean the image or video resolution? h and w ? because given the patch size of the model p = 2 (512 / 2) (512 / 2) 49 not 13,312?

Does this setting allow me to train on any resolution?

enable_bucket is used to enable bucket training. When enabled, the model does not crop the images and videos at the center, but instead, it trains the entire images and videos after grouping them into buckets based on resolution.

I am guessing this setting more or less crops.

training_with_video_token_length specifies training the model according to token length. The token length for a video with dimensions 512x512 and 49 frames is 13,312.
At 512x512 resolution, the number of video frames is 49;
At 768x768 resolution, the number of video frames is 21;
At 1024x1024 resolution, the number of video frames is 9;
These resolutions combined with their corresponding lengths allow the model to generate videos of different sizes.

How does image-to-video training work? I think it's effectively just the same as training text-to-video and can it be conditioned on inputs?

Closertodeath commented 2 weeks ago

Please use DeepSpeed

@bubbliiiing @yunkchen When I use deepspeed, the VRAM usage increases by 10gb as I noted in the OP. Is there something wrong with the default zero_stage2_config?

log26 commented 2 weeks ago

@Closertodeath I've encountered the same issue as well. Has it been resolved?

Closertodeath commented 2 weeks ago

@log26 Nope.

bubbliiiing commented 1 week ago

Please use DeepSpeed

@bubbliiiing @yunkchen When I use deepspeed, the VRAM usage increases by 10gb as I noted in the OP. Is there something wrong with the default zero_stage2_config?

Can you show me your parameters? I haven't encountered it

bubbliiiing commented 1 week ago

@bubbliiiing one more question

What do you mean by training by 512 by 512 and that the image token length is 512x512 and 49 frames is 13,312.

I am assuming you mean 512 by 512 image and not latent? or do you mean the image or video resolution? h and w ? because given the patch size of the model p = 2 (512 / 2) (512 / 2) 49 not 13,312?

Does this setting allow me to train on any resolution?

enable_bucket is used to enable bucket training. When enabled, the model does not crop the images and videos at the center, but instead, it trains the entire images and videos after grouping them into buckets based on resolution.

I am guessing this setting more or less crops.

training_with_video_token_length specifies training the model according to token length. The token length for a video with dimensions 512x512 and 49 frames is 13,312.
At 512x512 resolution, the number of video frames is 49;
At 768x768 resolution, the number of video frames is 21;
At 1024x1024 resolution, the number of video frames is 9;
These resolutions combined with their corresponding lengths allow the model to generate videos of different sizes.

How does image-to-video training work? I think it's effectively just the same as training text-to-video and can it be conditioned on inputs?

I didn't understand the question: "How does image-to-video training work? I think it's effectively just the same as training text-to-video and can it be conditioned on inputs?"

The token length of 512x512x49 is calculated as follows (512 / 8 (vae) / 2 (patch)) (512 / 8 (vae) / 2 (patch)) (49 - 1 / 4 + 1 (vae)) = 13312

ArEnSc commented 1 week ago

The token length of 512x512x49 is calculated as follows (512 / 8 (vae) / 2 (patch)) (512 / 8 (vae) / 2 (patch)) (49 - 1 / 4 + 1 (vae)) = 13312

What the 8 value from the vae ? and (49 - 1 / 4 + 1 (vae))

I didn't understand the question: "How does image-to-video training work? I think it's effectively just the same as training text-to-video and can it be conditioned on inputs?"

I guess I don't understand how image to video training works. Is the first image and last image given some noise and then the images in between denoised? For image to video?

@bubbliiiing

ArEnSc commented 1 week ago

Oh I get it now thank you~~~