THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
8.28k stars 783 forks source link

log video OOM #361

Open Eurus-Holmes opened 1 month ago

Eurus-Holmes commented 1 month ago

System Info / 系統信息

H100 (80GB)

Information / 问题信息

Reproduction / 复现过程

I can run sat sft examples normally, but when I try to log video on wandb to check more training process, I got OOM: only changed sft.yaml: only_log_video_latents: False and added wandb got OOM error

in log_video
[rank0]:     log["reconstructions"] = self.decode_first_stage(z).to(torch.float32)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.08 GiB. GPU 0 has a total capacity of 79.11 GiB of which 6.68 GiB is free. Process 1058461 has 72.39 GiB memory in use. Of the allocated memory 64.08 GiB is allocated by PyTorch, and 95.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior / 期待表现

I already set batch_size=1, not sure what else need to change to use log_video

zRzRzRzRzRzRzR commented 1 month ago

The data for sft was uesd on 16 H100s and the number is GPU memory cost in each GPU, not on 1. Are you fine-tuning 5B?

Eurus-Holmes commented 1 month ago

The data for sft was uesd on 16 H100s and the number is GPU memory cost in each GPU, not on 1. Are you fine-tuning 5B?

@zRzRzRzRzRzRzR yes fine-tuning 5B SFT not Lora, try to set only_log_video_latents: False to log_video but got OOM