derrian-distro / LoRA_Easy_Training_Scripts

A UI made in Pyside6 to make training LoRA/LoCon and other LoRA type models in sd-scripts easy
GNU General Public License v3.0
1.06k stars 103 forks source link

Scheduled Lora defaults to Nan past 2~4 loras down in the queue #180

Closed kukaiN closed 7 months ago

kukaiN commented 7 months ago

Hi, first I love your work and it's super amazing, been using it for 4 months.

I often schedule 4 ~ 6 loras (XL or 1.5 depending on the day, I don't mix them in the queue), and have them bake when I'm asleep. I often notice that the last 2 ~3 loras in the queue sometimes goes to nan and the loras are broken. I don't think this is a toml issue because I restart my PC and start a new instance of the training script and I load the toml associated with the nan lora and it comes out ok. Is there a value that's not reinitialized before moving onto the next item in the queue? I often train different learning rates so it can be a strong lr --> nan situation but I also had talks with others that called the queue "cursed" and I was wondering if you will take a look at it.

There's no error that's being spit out in the console so I don't have any insight on that. However I have the wandb turned on (I use cosine with restart with warmup) so I can observe that the warmup (5%) is working correctly and the lr goes up, then it suddenly crashes to nan.

Any insight would be amazing and thank you for your time.

derrian-distro commented 7 months ago

well, if it goes to nan, then it's probably that you have an lr too high, and the weights are growing too large, i've talked to many people about the queue system and nobody has mentioned it breaking to me, nor have I experienced it myself. that being said, I already did a complete rewrite of everything over on dev, so it's likely that if there was an issue, it's already fixed