Open WebsheetPlugin opened 2 months ago
I ran the unpaired training example again today morning and get expected results.
This is the accelerate config file i use:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1,2,3
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
and this is the training command:
export NCCL_P2P_DISABLE=1
accelerate launch --main_process_port 29501 --config_file temp/config_4gpu_full.sh \
src/train_cyclegan_turbo.py \
--pretrained_model_name_or_path="stabilityai/sd-turbo" \
--output_dir="output/cyclegan_turbo/my_horse2zebra" \
--dataset_folder "data/my_horse2zebra" \
--train_img_prep "resize_286_randomcrop_256x256_hflip" --val_img_prep "no_resize" \
--learning_rate="1e-5" --max_train_steps=25000 \
--train_batch_size=1 --gradient_accumulation_steps=1 \
--report_to "wandb" --tracker_project_name "gparmar_unpaired_h2z_cycle_debug_v3" \
--enable_xformers_memory_efficient_attention --validation_steps 250 \
--lambda_gan 0.5 --lambda_idt 1 --lambda_cycle 1
Please let me know if you are unable to obtain similar results!
-Gaurav
I was sceptical at first too but I also managed to reproduce the results. I did not dive too deep into it but here are my current findings:
The only thing that is really confusing me is the high VRAM usage. @GaParmar you stated somewhere that you trained the 512x512 models on a GPU with 48GB of RAM. For my experiments I could not come close to this value. On a H100 it needs ~56GB of RAM for BS of 1. Also I'm very confused why gradient accumulation, xformers and TF32 have nearly no effect and only reduce the occupied RAM by at most 5-10% which is much less of an effect as I see it when training other SD models. And finally training 512x models on an H100 with BS1 (with gradient accumulation of 8) would take about 12 days to reach 25000 steps which seems insane to me. @GaParmar Are these observations at least close to what you found when training 512x models? Or is there some trick to reduce the VRAM usage an training time?
@tim-kuechler interesting observations! I'm currently in the middle of a training on a custom dataset. I'm trying to do blurring/deblurring. So far results look good. I train on 4 GPUs (3090 w 24GB ram), resolution 256x256 with grad accumulation of 2, so effective batch size is 8. Each gpu is using 19.5 GB. Unfortunately I can't go to 448 or 512 with the VRAM I got.
I wonder if I can use a model trained on 256 and use it as is for higher res. Maybe if not, finetuning it using a method that doesn't tune full weights will work? I regularly train vision transformers to good performance by only training attention layers + top layers. On the other hand LORA adapters are already utilized here, so maybe there's not more memory gains to be had?
Examples:
Hi @tfriedel
Based on your dataset and task, you can try training your model on random crops during training time and full resolution at test time. This should enable you to maintain high resolution outputs.
Hi @GaParmar, the point @tim-kuechler brought up is something really hindering my training. It seems like vae.encode() and vae.decode() in the forward pass of Pix2Pix_Turbo consumes enormous amounts of memory, which makes it impossible for me to train with a batch size larger than five on a A100 with 80 GB RAM.
I have followed the guide on how to train pix2pix_turbo on my own paired dataset. Please describe how you can train with significantly higher batch size.
@swold99 That low batch size is expected. I'm interested how you found out that the high memory usage is due to the VAE?
To increase the effective batch size you can use gradient accumulation (--gradient_accumulation x). If you have e.g. a native batch size of 4 and use gradient accumulation with 2 this leads to an effective batch size of 8 although it also doubles your training time. Normally there should be no difference in using a native bs of x or a accumulated bs of x, at least not in model performance.
For my trainings I can use a native bs of 8 because I use a version of a version of the H100 with 100GB of VRAM.
@tim-kuechler I printed the memory usage on a lot of places in the code and noticed that allocated and reserved GPU memory increased by around 30 GB (batch_size=5) after both vae.encode() and vae.decode().
I ran the unpaired training example again today morning and get expected results.
This is the accelerate config file i use:
compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: 0,1,2,3 machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
and this is the training command:
export NCCL_P2P_DISABLE=1 accelerate launch --main_process_port 29501 --config_file temp/config_4gpu_full.sh \ src/train_cyclegan_turbo.py \ --pretrained_model_name_or_path="stabilityai/sd-turbo" \ --output_dir="output/cyclegan_turbo/my_horse2zebra" \ --dataset_folder "data/my_horse2zebra" \ --train_img_prep "resize_286_randomcrop_256x256_hflip" --val_img_prep "no_resize" \ --learning_rate="1e-5" --max_train_steps=25000 \ --train_batch_size=1 --gradient_accumulation_steps=1 \ --report_to "wandb" --tracker_project_name "gparmar_unpaired_h2z_cycle_debug_v3" \ --enable_xformers_memory_efficient_attention --validation_steps 250 \ --lambda_gan 0.5 --lambda_idt 1 --lambda_cycle 1
Please let me know if you are unable to obtain similar results!
-Gaurav
Hi, @GaParmar ,I just copy your code, including accelerate config and training code, but I meet the questions below, could you please help me analyze these problems? thank you very much! ![Uploading 截图 2024-09-26 15-37-12.png…]()
Me the same. I can train with batch size 1 and image size 256x256 by using RTX A6000 50GB memory. It consumes 26GB memory. When i try to increase resolution to 512, it never succeed. The largest resolution I can try is 384x384.
I have been reading the open comments here, and I am growing skeptical that this code works regarding training..
Does anyone have it working? Does anyone successfully trained the Zebra example?