Does anyone trained succesfully with cycleGan unpaired?

WebsheetPlugin commented 2 months ago

I have been reading the open comments here, and I am growing skeptical that this code works regarding training..

Does anyone have it working? Does anyone successfully trained the Zebra example?

GaParmar commented 2 months ago

I ran the unpaired training example again today morning and get expected results.

This is the accelerate config file i use:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1,2,3
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

and this is the training command:

export NCCL_P2P_DISABLE=1
accelerate launch --main_process_port 29501 --config_file temp/config_4gpu_full.sh \
    src/train_cyclegan_turbo.py \
    --pretrained_model_name_or_path="stabilityai/sd-turbo" \
    --output_dir="output/cyclegan_turbo/my_horse2zebra" \
    --dataset_folder "data/my_horse2zebra" \
    --train_img_prep "resize_286_randomcrop_256x256_hflip" --val_img_prep "no_resize" \
    --learning_rate="1e-5" --max_train_steps=25000 \
    --train_batch_size=1 --gradient_accumulation_steps=1 \
    --report_to "wandb" --tracker_project_name "gparmar_unpaired_h2z_cycle_debug_v3" \
    --enable_xformers_memory_efficient_attention --validation_steps 250 \
    --lambda_gan 0.5 --lambda_idt 1 --lambda_cycle 1

Please let me know if you are unable to obtain similar results!

-Gaurav

tim-kuechler commented 2 months ago

I was sceptical at first too but I also managed to reproduce the results. I did not dive too deep into it but here are my current findings:

The training seems very seed dependant. With exactly the same command and setup but different seeds I got completely different results (one diverging, one converging)
Dependent on the dataset the cycle loss might be too high. For my dataset the cycle loss was way too high leading to a perfect reconstruction but completely nonsense in the translation.
Training with batch size 8 leads to much better convergence and results than with batch size 1
I'm not sure about this point or if there is an error with the gradient accumulation but for my experiments using a native batch size of 8 yields much better results than accumulating 8 batches of size 1, which is weird.
It might take a couple of 1000 steps to see first good translation effects.
The bugfix that was merged now with the zero_grad for the discriminator is definitely helping to get good results (although it also works okay without it)

The only thing that is really confusing me is the high VRAM usage. @GaParmar you stated somewhere that you trained the 512x512 models on a GPU with 48GB of RAM. For my experiments I could not come close to this value. On a H100 it needs ~56GB of RAM for BS of 1. Also I'm very confused why gradient accumulation, xformers and TF32 have nearly no effect and only reduce the occupied RAM by at most 5-10% which is much less of an effect as I see it when training other SD models. And finally training 512x models on an H100 with BS1 (with gradient accumulation of 8) would take about 12 days to reach 25000 steps which seems insane to me. @GaParmar Are these observations at least close to what you found when training 512x models? Or is there some trick to reduce the VRAM usage an training time?

tfriedel commented 2 months ago

@tim-kuechler interesting observations! I'm currently in the middle of a training on a custom dataset. I'm trying to do blurring/deblurring. So far results look good. I train on 4 GPUs (3090 w 24GB ram), resolution 256x256 with grad accumulation of 2, so effective batch size is 8. Each gpu is using 19.5 GB. Unfortunately I can't go to 448 or 512 with the VRAM I got.

I wonder if I can use a model trained on 256 and use it as is for higher res. Maybe if not, finetuning it using a method that doesn't tune full weights will work? I regularly train vision transformers to good performance by only training attention layers + top layers. On the other hand LORA adapters are already utilized here, so maybe there's not more memory gains to be had?

Examples: grafik grafik

GaParmar commented 2 months ago

Hi @tfriedel

Based on your dataset and task, you can try training your model on random crops during training time and full resolution at test time. This should enable you to maintain high resolution outputs.

swold99 commented 2 months ago

Hi @GaParmar, the point @tim-kuechler brought up is something really hindering my training. It seems like vae.encode() and vae.decode() in the forward pass of Pix2Pix_Turbo consumes enormous amounts of memory, which makes it impossible for me to train with a batch size larger than five on a A100 with 80 GB RAM.

I have followed the guide on how to train pix2pix_turbo on my own paired dataset. Please describe how you can train with significantly higher batch size.

tim-kuechler commented 2 months ago

@swold99 That low batch size is expected. I'm interested how you found out that the high memory usage is due to the VAE?

To increase the effective batch size you can use gradient accumulation (--gradient_accumulation x). If you have e.g. a native batch size of 4 and use gradient accumulation with 2 this leads to an effective batch size of 8 although it also doubles your training time. Normally there should be no difference in using a native bs of x or a accumulated bs of x, at least not in model performance.

For my trainings I can use a native bs of 8 because I use a version of a version of the H100 with 100GB of VRAM.

swold99 commented 2 months ago

@tim-kuechler I printed the memory usage on a lot of places in the code and noticed that allocated and reserved GPU memory increased by around 30 GB (batch_size=5) after both vae.encode() and vae.decode().

YijiFeng commented 2 months ago

I ran the unpaired training example again today morning and get expected results.

This is the accelerate config file i use:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1,2,3
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

and this is the training command:

export NCCL_P2P_DISABLE=1
accelerate launch --main_process_port 29501 --config_file temp/config_4gpu_full.sh \
    src/train_cyclegan_turbo.py \
    --pretrained_model_name_or_path="stabilityai/sd-turbo" \
    --output_dir="output/cyclegan_turbo/my_horse2zebra" \
    --dataset_folder "data/my_horse2zebra" \
    --train_img_prep "resize_286_randomcrop_256x256_hflip" --val_img_prep "no_resize" \
    --learning_rate="1e-5" --max_train_steps=25000 \
    --train_batch_size=1 --gradient_accumulation_steps=1 \
    --report_to "wandb" --tracker_project_name "gparmar_unpaired_h2z_cycle_debug_v3" \
    --enable_xformers_memory_efficient_attention --validation_steps 250 \
    --lambda_gan 0.5 --lambda_idt 1 --lambda_cycle 1

Please let me know if you are unable to obtain similar results!

-Gaurav

Hi, @GaParmar ,I just copy your code, including accelerate config and training code, but I meet the questions below, could you please help me analyze these problems? thank you very much! ![Uploading 截图 2024-09-26 15-37-12.png…]()

Jumponthemoon commented 1 month ago

Me the same. I can train with batch size 1 and image size 256x256 by using RTX A6000 50GB memory. It consumes 26GB memory. When i try to increase resolution to 512, it never succeed. The largest resolution I can try is 384x384.

GaParmar / img2img-turbo

Does anyone trained succesfully with cycleGan unpaired? #87