low VRAM usage - Githubissues

renatomserra commented 2 months ago

Hello, thank you so much for the detailed guide!

Ive sucessfully started training but when looking at vast ai it shows only 25/45GB of gpu being used.

What am i doing wrong? running on a Q RTX 8000

{ "--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev", "--model_family": "flux", "--model_type": "lora", "--lora_type": "standard", "--lora_rank": 16, "--flux_lora_target": "all+ffs", "--optimizer": "adamw_bf16", "--train_batch_size": 16, "--gradient_accumulation_steps": 1, "--learning_rate": "1e-4", "--max_train_steps": 1000, "--num_train_epochs": 0, "--checkpointing_steps": 200, "--validation_steps": 100, "--validation_prompt": "a picture of 34555", "--validation_seed": 42, "--validation_resolution": "1024x1024", "--validation_guidance": 6, "--validation_guidance_rescale": "0.0", "--validation_num_inference_steps": "15", "--validation_negative_prompt": "", "--hub_model_id": "....", "--tracker_project_name": "...", "--tracker_run_name": "....", "--resume_from_checkpoint": "latest", "--data_backend_config": "config/multidatabackend.json", "--aspect_bucket_rounding": 2, "--seed": 42, "--minimum_image_size": 0, "--output_dir": "/root/SimpleTuner/output/models", "--checkpoints_total_limit": 2, "--push_to_hub": "true", "--push_checkpoints_to_hub": "true", "--report_to": "none", "--flux_guidance_value": 1.0, "--max_grad_norm": 1.0, "--flux_schedule_auto_shift": "true", "--validation_on_startup": "true", "--gradient_checkpointing": "true", "--caption_dropout_probability": 0.0, "--vae_batch_size": 1, "--allow_tf32": "true", "--resolution_type": "pixel_area", "--resolution": 1024, "--mixed_precision": "bf16", "--lr_scheduler": "constant_with_warmup", "--lr_warmup_steps": 100, "--metadata_update_interval": 60, "--validation_torch_compile": "false" }

Im using a set of 19 images 1024x1024

would love some tips on how to maximize a GPU usage

renatomserra commented 2 months ago

getting this after a while

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.59 GiB. GPU 0 has a total capacity of 44.48 GiB of which 3.30 GiB is free. Process 3079794 has 41.18 GiB memory in use. Of the allocated memory 32.85 GiB is allocated by PyTorch, and 8.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

AmericanPresidentJimmyCarter commented 2 months ago

"--train_batch_size": 16

Try a smaller batch size, like 4 or 8.

renatomserra commented 2 months ago

thank you, will give that a try! Is there any way locally (within instance) checking progress of the training, like % wise?

renatomserra commented 2 months ago

Still getting this with train_batch_size of 4, any idea

AmericanPresidentJimmyCarter commented 2 months ago

Try "--base_model_precision": "fp8-quanto" in your config.json

renatomserra commented 2 months ago

yeah that worked, thank you! Would i need this in a a100 with 80GB?

AmericanPresidentJimmyCarter commented 2 months ago

No, unlikely. That card has lots of VRAM.

renatomserra commented 2 months ago

Thank you for helping out, i appreciate it.

So i changed to a 80gb A100, playing around seems train_batch_size=1 yields fastest training with 2.26s/it but looking at vast ai only 25GB of 80 are being used with this, while higher train_batch_size uses more GPU but yields much slower s/it. Is there anything that would use more GPU RAM and accelerate the process?

Or am i looking at it wrong? says 100% GPU bottom left but only 27GB/80GB above

AmericanPresidentJimmyCarter commented 2 months ago

Please refer to the guide:

The batch size is how many samples are run at the same time. Higher batch sizes tend to be more stable, higher quality, and learn faster, but are slower and take more VRAM. For the absolute fastest training or when running on low memory systems like 3090/4090, set this to be 1 or 2.

So the steps are slower, but the model can often learn more from them and/or you can also use a higher LR.

AmericanPresidentJimmyCarter / simple-flux-lora-training

low VRAM usage #3