guidelines for 2 minute LoRA training run to mirror Fal's fast flux trainer

hey Jim, thanks for the work on the guide.

I wanted to contribute some info here for anyone wanting to train a potato LoRA like Fal offers.

Hardware: 8x H100 from Vast.ai for $21/hr
- Don't worry, we won't be running it for an hour.
Pytorch image: pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel
SimpleTuner version: latest git main (includes FlashAttention3 and FP8 fixes)
- v1.1.1 also works but does not include FA3 or TorchAO's FP8

Your config.json needs the following values set:

quantize_via=accelerator
- speeds up quantisation (if enabled - by default, it won't be)
max_grad_norm=0.01
optimizer=bnb-lion8bit
learning_rate=2e-4
train_batch_size=1
validation_torch_compile=true
lora_rank=16
lora_alpha=16
flux_lora_target=all+ffs
model_type=lora
lora_type=standard
base_model_precision=no_change
- BF16 is the fastest training precision level currently in Pytorch 2.6 (Sep 29th build)
- fp8-torchao is the fastest compiled training precision level but takes a long time to compile at startup

The following values get added into config.env:

export TRAINING_NUM_PROCESSES=8
# Uncomment this if you want to use torch.compile for more speedup if you intend on training much longer than 2 minutes.
# Compiling takes a good 5-10 minutes depending on the system and the chosen flags, so if your training run is shorter,
# it does not make sense to enable torch compile.
#export TRAINING_DYNAMO_BACKEND=inductor

For multidatabackend.json, we just use a single square-cropped 512x512 dataset.

This configuration results in 1 iteration per second with about 2 minutes for 100 steps (incl ~20-30 second startup time)

AmericanPresidentJimmyCarter / simple-flux-lora-training

guidelines for 2 minute LoRA training run to mirror Fal's fast flux trainer #5