VAE Training get CUDA OOM error

spacegoing commented 1 month ago

 18 torchrun \                                                                                                                                                                                                                         
 19     --nnodes=1 --nproc_per_node=8 \                                                                                                                                                                                                
 20     --master_addr=localhost \                                                                                                                                                                                                      
 21     --master_port=29600 \                                                                                                                                                                                                          
 22     /workspace/Open-Sora-Plan/opensora/train/train_causalvae.py \                                                                                                                                                                  
 23     --video_path /workspace/host_folder/mycogvx/sat/pre/tmpbos/class5_10k_split \                                                                                                                                                  
 24     --eval_video_path /workspace/host_folder/mycogvx/sat/pre/tmpbos/m5_6k/ \                                                                                                                                                       
 25     --eval_batch_size 1 \                                                                                                                                                                                                          
 26     --eval_subset_size 1 \                                                                                                                                                                                                         
 27     --mix_precision bf16 \                                                                                                                                                                                                         
 28     --exp_name ${EXP_NAME} \                                                                                                                                                                                                       
 29     --model_config scripts/config.json \                                                                                                                                                                                           
 30     --resolution 320 \                                                                                                                                                                                                             
 31     --epochs 1000 \                                                                                                                                                                                                                
 32     --num_frames 25 \                                                                                                                                                                                                              
 33     --batch_size 1 \                                                                                                                                                                                                               
 34     --disc_start 2000 \                                                                                                                                                                                                            
 35     --save_ckpt_step 2000 \                                                                                                                                                                                                        
 36     --eval_steps 500 \                                                                                                                                                                                                             
 37     --eval_num_frames 33 \                                                                                                                                                                                                         
 38     --eval_sample_rate 3 \                                                                                                                                                                                                         
 39     --eval_lpips \                                                                                                                                                                                                                 
 40     --ema \                                                                                                                                                                                                                        
 41     --ema_decay 0.999 \                                                                                                                                                                                                            
 42     --perceptual_weight 1.0 \                                                                                                                                                                                                      
 43     --loss_type l1 \                                                                                                                                                                                                               
 44     --disc_cls opensora.models.causalvideovae.model.losses.LPIPSWithDiscriminator3D \                                                                                                                                              
 45     --not_resume_training_process \                                                                                                                                                                                                
 46     --pretrained_model_name_or_path /workspace/public/models/Open-Sora-Plan-v1.2.0/vae

above is my training scripts. although batch_size for train and eval are both 1, with 8 GPUs, CUDA still complains

[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.22 GiB. GPU 2 has a total capacity of 79.33 GiB of which 688.00 MiB is free. Process 3711898 has 78.64 GiB memory in use. Of the allocated memory 75.65 GiB is
 allocated by PyTorch, and 1.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Mem
ory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

qqingzheng commented 1 month ago

Reduce the resolution to 256.

spacegoing commented 1 month ago

Reduce the resolution to 256.

It works, many thanks!

BTYW, how do I train with higher resolution? From readme I thought opensora implemented sequence parallel but didn't find it in code.

qqingzheng commented 1 month ago

The convolutional structure of VAE has strong extrapolation, and we only trained it at low resolutions. SP training was used for training the diffusion model.

spacegoing commented 1 month ago

The convolutional structure of VAE has strong extrapolation, and we only trained it at low resolutions. SP training was used for training the diffusion model.

Many thanks for your reply!

PKU-YuanGroup / Open-Sora-Plan

VAE Training get CUDA OOM error #432