Suggestion: Ability to change beta 1, beta 2, and weight decay in an advanced training option

aria1th / Hypernetwork-MonkeyPatch-Extension

Extension that patches Hypernetwork structures and training

116 stars 10 forks source link

Suggestion: Ability to change beta 1, beta 2, and weight decay in an advanced training option #11

Closed Heathen closed 1 year ago

Heathen commented 1 year ago

Right now the only way to change those is by editing the .py file, would be nice to have them in the UI.

aria1th commented 1 year ago

Can you explain it more specific? (what is beta 1, beta 2? ) For weight decay, I guess its AdamW's weight decay parameter? I'll do it soon, I'm just being too distracted due to windows update blowing up my whole progress... Please suggest features freely, but more detailed stuff will be helpful.

Heathen commented 1 year ago

Sure, here's the documentation: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

class torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, *, maximize=False, foreach=None, capturable=False)

Betas are the momentum moderators while weight decay moves every weight in the neural network towards zero every step.

I've been training with weight_decay=0.1 and it's a lot more efficient. Personally I only change weight decay, but a few people in the automatic git are changing beta as well.

aria1th commented 1 year ago

I added in option, https://github.com/aria1th/Hypernetwork-MonkeyPatch-Extension/commit/08e2bf683d7c9cd26baf394fcdc917bf5155e2c3

Note that I'm not sure if loading optimizer state applies hyperparameters as well, so keep in mind. At least hyperparamters are applied first, then optimizer is loaded, so it could be overwritten(probably).

Heathen commented 1 year ago

Tested and it is working, but also, you might want to let people use values higher than 1 for weight decay. Some people are using up to 200 in it. As your weights moving towards zero isn't bad in a hypernetwork, it becomes a pass-through for the original model.

Training with really high weight decay forces the hypernetwork to maintain the original model's composition and subject, while only learning to make changes based on the training images. It's extremely good for style change without content changes.

Realistic, the maximum value should be 1/Learning rate as the weight decay step is 1-LR*WD*dW, dW being the delta of weight during this step + all the momentum shenanigans. As long as that value doesn't reach zero, it is fine.

Easier to limit it to 500 or so.

aria1th commented 1 year ago

Well I don't see weight decay being used over 1 in residual networks, which sometimes tests itself to converge to identity layer, so it might be unsafe... but I'll leave it to users.

Heathen commented 1 year ago

Someone was using 1e20 for weight decay and somehow it worked.

Anyway, LR 5e-4, WD 2.0, 57 gradients (full image set), 74 steps/epochs (same thing in this case), about 40 minutes of training time Untitled Untitled2

Easy style transfer. :>

To retain even more of the original composition, gotta use WD even higher, 10 or 20.

Heathen commented 1 year ago

Tested with even higher values, everything seems to be working well, thanks!