Closed Heathen closed 1 year ago
Can you explain it more specific? (what is beta 1, beta 2? ) For weight decay, I guess its AdamW's weight decay parameter? I'll do it soon, I'm just being too distracted due to windows update blowing up my whole progress... Please suggest features freely, but more detailed stuff will be helpful.
Sure, here's the documentation: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html
class torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, *, maximize=False, foreach=None, capturable=False)
Betas are the momentum moderators while weight decay moves every weight in the neural network towards zero every step.
I've been training with weight_decay=0.1 and it's a lot more efficient. Personally I only change weight decay, but a few people in the automatic git are changing beta as well.
I added in option, https://github.com/aria1th/Hypernetwork-MonkeyPatch-Extension/commit/08e2bf683d7c9cd26baf394fcdc917bf5155e2c3
Note that I'm not sure if loading optimizer state applies hyperparameters as well, so keep in mind. At least hyperparamters are applied first, then optimizer is loaded, so it could be overwritten(probably).
Tested and it is working, but also, you might want to let people use values higher than 1 for weight decay. Some people are using up to 200 in it. As your weights moving towards zero isn't bad in a hypernetwork, it becomes a pass-through for the original model.
Training with really high weight decay forces the hypernetwork to maintain the original model's composition and subject, while only learning to make changes based on the training images. It's extremely good for style change without content changes.
Realistic, the maximum value should be 1/Learning rate
as the weight decay step is 1-LR*WD*dW
, dW being the delta of weight during this step + all the momentum shenanigans. As long as that value doesn't reach zero, it is fine.
Easier to limit it to 500 or so.
Well I don't see weight decay being used over 1 in residual networks, which sometimes tests itself to converge to identity layer, so it might be unsafe... but I'll leave it to users.
Someone was using 1e20 for weight decay and somehow it worked.
Anyway, LR 5e-4, WD 2.0, 57 gradients (full image set), 74 steps/epochs (same thing in this case), about 40 minutes of training time
Easy style transfer. :>
To retain even more of the original composition, gotta use WD even higher, 10 or 20.
Tested with even higher values, everything seems to be working well, thanks!
Right now the only way to change those is by editing the .py file, would be nice to have them in the UI.