Closed a-l-e-x-d-s-9 closed 3 months ago
try
export USE_BITFIT=true
to comment out these lines
so the trainer for a full model utilises the BitFit technique for tuning, which freezes all the weights and tunes just the models' bias. i was wondering what it'd be like with SD3, and yes you can use a much higher LR and it will cook less. this is why I made it the default.
however, just this morning I changed the default example configs so that this isn't applied out of the box, but left there commented-out as an example of how it might work to apply a setting conditionally.
see some experiments here: https://wandb.ai/bghira/sd3-training?nw=nwuserbghira
but generally the full unfrozen weights & biases will cook no matter what LR you set, it's like it does nothing to the model at all, or suddenly it's bearing down like the boulder behind indiana jones and it picks up all the worst parts of the dataset and then fries itself.
you can tell it's frying because it goes into square grid nonsense and then loses all depth, contrast, and prompt adherence (in that order)
I have been experimenting with different learning rates. With LR 1e-6 it doesn't seem to make much of a change after 4k steps and a batch size ~25. With LR 1e-5, batch size 27, and after 600 steps mild change. Last run with LR 1e-4, batch size 27, after 1200 steps, the model seemed to improve with the style, but it is still not consistent and not very well learned. Even 1800 steps look like it's not enough. LR 1e-4 for full finetune is a huge LR that would have nuked 1.5 and SDXL, does it make sense that even with such a huge LR, the model is learning so slowly? Is it possible that the LR is ignored, normalized, or automatically adjusted somehow?
multidatabackend.json:
sdxl-env.sh: