Open leonary opened 2 months ago
same here.
a6000 cuda at rank 4 and 512. somethings whack here
I tried Lora training with flux-schnell and dev models using train batch size 1, gradient_accumulation_steps 4, and rank 2 on 40gb a100, but it still raised Cuda out of memory exception.
same here
You need to run accelerate config
to configure DeepSpeed first using the settings in the README, or make a accelerate_config.yaml for a single GPU:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Launch using:
accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"
With default settings in the example files this requires 42,837 MiB VRAM.
@thavocado How much RAM are you using? I"m getting a SIGKILL with ~50GB of RAM
Please use Deepspeed and set Accelerate accordingly. I train a lora with rank 16, it just needs 40GB vram.
really shouldn't need deepspeed though. i think it's because the VAE and T5 / CLIP are all loaded during training.
@thavocado How much RAM are you using? I"m getting a SIGKILL with ~50GB of RAM
@arcanite24 40GB of RAM here
https://github.com/XLabs-AI/x-flux/issues/12#issuecomment-2276998054 This information was very helpful, thank you.
I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:
It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)
Just 30 hours on ADA 6000. Using config from https://github.com/XLabs-AI/x-flux/issues/12#issuecomment-2276998054 is it fine or I need to adjust something. 42GB used
looks like the proper speed
#12 (comment) This information was very helpful, thank you.
I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:
- img_size: 1024: out of memory
- img_size: 512: ✅
It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)
#12 (comment) This information was very helpful, thank you.
I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:
- img_size: 1024: out of memory
- img_size: 512: ✅
It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)
I'm running on 8*A100 for 1024 img size and requires around 61GB for each GPU
You need to run
accelerate config
to configure DeepSpeed first using the settings in the README, or make a accelerate_config.yaml for a single GPU:compute_environment: LOCAL_MACHINE debug: false deepspeed_config: gradient_accumulation_steps: 2 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' enable_cpu_affinity: false machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
Launch using:
accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"
With default settings in the example files this requires 42,837 MiB VRAM.
I'm running on 1*A100 for 1024 img size and requires around 63GB(65147MiB) for A100-GPU, while i have processed the data and only load vae and dit model
Training lora encounters insufficient video memory on a single A100 80GB graphics card. Any help would be much appreciated. lora rank is 16 and batch size is 1.