A100 80GB lora training out of memory

leonary commented 2 months ago

Training lora encounters insufficient video memory on a single A100 80GB graphics card. Any help would be much appreciated. lora rank is 16 and batch size is 1.

11.206656 parameters
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
08/08/2024 04:37:22 - INFO - __main__ - ***** Running training *****
08/08/2024 04:37:22 - INFO - __main__ -   Num Epochs = 300
08/08/2024 04:37:22 - INFO - __main__ -   Instantaneous batch size per device = 1
08/08/2024 04:37:22 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
08/08/2024 04:37:22 - INFO - __main__ -   Gradient Accumulation steps = 2
08/08/2024 04:37:22 - INFO - __main__ -   Total optimization steps = 300
Checkpoint 'latest' does not exist. Starting a new training run.
Steps:   0%|                                                                                                                                                                                                                                          | 0/300 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/autodl-tmp/x-flux/train_flux_lora_deepspeed.py", line 301, in <module>
    main()
  File "/root/autodl-tmp/x-flux/train_flux_lora_deepspeed.py", line 231, in main
    model_pred = dit(img=x_t.to(weight_dtype),
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/autodl-tmp/x-flux/src/flux/model.py", line 213, in forward
    img = block(img, vec=vec, pe=pe)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/autodl-tmp/x-flux/src/flux/modules/layers.py", line 332, in forward
    qkv, mlp = torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /root/autodl-tmp/x-flux/wandb/offline-run-20240808_043714-dhyc07rv
wandb: Find logs at: ./wandb/offline-run-20240808_043714-dhyc07rv/logs
Traceback (most recent call last):
  File "/root/miniconda3/envs/3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/3.10/bin/python', 'train_flux_lora_deepspeed.py', '--config', 'train_configs/test_lora.yaml']' returned non-zero exit status 1.

filliptm commented 2 months ago

same here.

a6000 cuda at rank 4 and 512. somethings whack here

m-pektas commented 2 months ago

I tried Lora training with flux-schnell and dev models using train batch size 1, gradient_accumulation_steps 4, and rank 2 on 40gb a100, but it still raised Cuda out of memory exception.

philz1337x commented 2 months ago

same here

thavocado commented 2 months ago

You need to run accelerate config to configure DeepSpeed first using the settings in the README, or make a accelerate_config.yaml for a single GPU:

compute_environment: LOCAL_MACHINE
debug: false                                                                                               
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Launch using:

accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"

With default settings in the example files this requires 42,837 MiB VRAM.

arcanite24 commented 2 months ago

@thavocado How much RAM are you using? I"m getting a SIGKILL with ~50GB of RAM

iamwangyabin commented 2 months ago

Please use Deepspeed and set Accelerate accordingly. I train a lora with rank 16, it just needs 40GB vram.

bghira commented 2 months ago

really shouldn't need deepspeed though. i think it's because the VAE and T5 / CLIP are all loaded during training.

thavocado commented 2 months ago

@thavocado How much RAM are you using? I"m getting a SIGKILL with ~50GB of RAM

@arcanite24 40GB of RAM here

WarriorMama777 commented 2 months ago

https://github.com/XLabs-AI/x-flux/issues/12#issuecomment-2276998054 This information was very helpful, thank you.

I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:

img_size: 1024: out of memory
img_size: 512: ✅

It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)

CCRss commented 2 months ago

Just 30 hours on ADA 6000. Using config from https://github.com/XLabs-AI/x-flux/issues/12#issuecomment-2276998054 is it fine or I need to adjust something. 42GB used

bghira commented 2 months ago

looks like the proper speed

ZhePang commented 1 month ago

#12 (comment) This information was very helpful, thank you.

I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:

img_size: 1024: out of memory

img_size: 512: ✅

It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)

#12 (comment) This information was very helpful, thank you.

I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:

img_size: 1024: out of memory

img_size: 512: ✅

It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)

I'm running on 8*A100 for 1024 img size and requires around 61GB for each GPU

aiXia121 commented 4 weeks ago

You need to run accelerate config to configure DeepSpeed first using the settings in the README, or make a accelerate_config.yaml for a single GPU:
compute_environment: LOCAL_MACHINE
debug: false                                                                                               
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Launch using:

accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"

With default settings in the example files this requires 42,837 MiB VRAM.

I'm running on 1*A100 for 1024 img size and requires around 63GB（65147MiB） for A100-GPU， while i have processed the data and only load vae and dit model

XLabs-AI / x-flux

A100 80GB lora training out of memory #12