log video OOM - Githubissues

Eurus-Holmes commented 1 month ago

System Info / 系統信息

H100 (80GB)

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

I can run sat sft examples normally, but when I try to log video on wandb to check more training process, I got OOM: only changed sft.yaml: only_log_video_latents: False and added wandb got OOM error

in log_video
[rank0]:     log["reconstructions"] = self.decode_first_stage(z).to(torch.float32)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.08 GiB. GPU 0 has a total capacity of 79.11 GiB of which 6.68 GiB is free. Process 1058461 has 72.39 GiB memory in use. Of the allocated memory 64.08 GiB is allocated by PyTorch, and 95.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior / 期待表现

I already set batch_size=1, not sure what else need to change to use log_video

zRzRzRzRzRzRzR commented 1 month ago

The data for sft was uesd on 16 H100s and the number is GPU memory cost in each GPU, not on 1. Are you fine-tuning 5B?

Eurus-Holmes commented 1 month ago

The data for sft was uesd on 16 H100s and the number is GPU memory cost in each GPU, not on 1. Are you fine-tuning 5B?

@zRzRzRzRzRzRzR yes fine-tuning 5B SFT not Lora, try to set only_log_video_latents: False to log_video but got OOM

THUDM / CogVideo

log video OOM #361

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现