Closed renatomserra closed 2 months ago
getting this after a while
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.59 GiB. GPU 0 has a total capacity of 44.48 GiB of which 3.30 GiB is free. Process 3079794 has 41.18 GiB memory in use. Of the allocated memory 32.85 GiB is allocated by PyTorch, and 8.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
"--train_batch_size": 16
Try a smaller batch size, like 4 or 8.
thank you, will give that a try! Is there any way locally (within instance) checking progress of the training, like % wise?
Still getting this with train_batch_size
of 4, any idea
Try "--base_model_precision": "fp8-quanto"
in your config.json
yeah that worked, thank you! Would i need this in a a100 with 80GB?
No, unlikely. That card has lots of VRAM.
Thank you for helping out, i appreciate it.
So i changed to a 80gb A100, playing around seems train_batch_size=1
yields fastest training with 2.26s/it but looking at vast ai only 25GB of 80 are being used with this, while higher train_batch_size uses more GPU but yields much slower s/it.
Is there anything that would use more GPU RAM and accelerate the process?
Or am i looking at it wrong? says 100% GPU bottom left but only 27GB/80GB above
Please refer to the guide:
The batch size is how many samples are run at the same time. Higher batch sizes tend to be more stable, higher quality, and learn faster, but are slower and take more VRAM. For the absolute fastest training or when running on low memory systems like 3090/4090, set this to be 1 or 2.
So the steps are slower, but the model can often learn more from them and/or you can also use a higher LR.
Hello, thank you so much for the detailed guide!
Ive sucessfully started training but when looking at vast ai it shows only 25/45GB of gpu being used.
What am i doing wrong? running on a Q RTX 8000
{ "--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev", "--model_family": "flux", "--model_type": "lora", "--lora_type": "standard", "--lora_rank": 16, "--flux_lora_target": "all+ffs", "--optimizer": "adamw_bf16", "--train_batch_size": 16, "--gradient_accumulation_steps": 1, "--learning_rate": "1e-4", "--max_train_steps": 1000, "--num_train_epochs": 0, "--checkpointing_steps": 200, "--validation_steps": 100, "--validation_prompt": "a picture of 34555", "--validation_seed": 42, "--validation_resolution": "1024x1024", "--validation_guidance": 6, "--validation_guidance_rescale": "0.0", "--validation_num_inference_steps": "15", "--validation_negative_prompt": "", "--hub_model_id": "....", "--tracker_project_name": "...", "--tracker_run_name": "....", "--resume_from_checkpoint": "latest", "--data_backend_config": "config/multidatabackend.json", "--aspect_bucket_rounding": 2, "--seed": 42, "--minimum_image_size": 0, "--output_dir": "/root/SimpleTuner/output/models", "--checkpoints_total_limit": 2, "--push_to_hub": "true", "--push_checkpoints_to_hub": "true", "--report_to": "none", "--flux_guidance_value": 1.0, "--max_grad_norm": 1.0, "--flux_schedule_auto_shift": "true", "--validation_on_startup": "true", "--gradient_checkpointing": "true", "--caption_dropout_probability": 0.0, "--vae_batch_size": 1, "--allow_tf32": "true", "--resolution_type": "pixel_area", "--resolution": 1024, "--mixed_precision": "bf16", "--lr_scheduler": "constant_with_warmup", "--lr_warmup_steps": 100, "--metadata_update_interval": 60, "--validation_torch_compile": "false" }
Im using a set of 19 images 1024x1024
would love some tips on how to maximize a GPU usage