SimpleTuner version: latest git main (includes FlashAttention3 and FP8 fixes)
v1.1.1 also works but does not include FA3 or TorchAO's FP8
Your config.json needs the following values set:
quantize_via=accelerator
speeds up quantisation (if enabled - by default, it won't be)
max_grad_norm=0.01
optimizer=bnb-lion8bit
learning_rate=2e-4
train_batch_size=1
validation_torch_compile=true
lora_rank=16
lora_alpha=16
flux_lora_target=all+ffs
model_type=lora
lora_type=standard
base_model_precision=no_change
BF16 is the fastest training precision level currently in Pytorch 2.6 (Sep 29th build)
fp8-torchao is the fastest compiled training precision level but takes a long time to compile at startup
The following values get added into config.env:
export TRAINING_NUM_PROCESSES=8
# Uncomment this if you want to use torch.compile for more speedup if you intend on training much longer than 2 minutes.
# Compiling takes a good 5-10 minutes depending on the system and the chosen flags, so if your training run is shorter,
# it does not make sense to enable torch compile.
#export TRAINING_DYNAMO_BACKEND=inductor
For multidatabackend.json, we just use a single square-cropped 512x512 dataset.
This configuration results in 1 iteration per second with about 2 minutes for 100 steps (incl ~20-30 second startup time)
hey Jim, thanks for the work on the guide.
I wanted to contribute some info here for anyone wanting to train a potato LoRA like Fal offers.
Your config.json needs the following values set:
quantize_via=accelerator
max_grad_norm=0.01
optimizer=bnb-lion8bit
learning_rate=2e-4
train_batch_size=1
validation_torch_compile=true
lora_rank=16
lora_alpha=16
flux_lora_target=all+ffs
model_type=lora
lora_type=standard
base_model_precision=no_change
fp8-torchao
is the fastest compiled training precision level but takes a long time to compile at startupThe following values get added into config.env:
For multidatabackend.json, we just use a single square-cropped 512x512 dataset.
This configuration results in 1 iteration per second with about 2 minutes for 100 steps (incl ~20-30 second startup time)