Closed cnellington closed 1 year ago
We could do something like this:
import numpy as np
def d_adaptation(x0, f, grad_f, D0, n_iter):
x = x0
D = D0
s = np.zeros_like(x)
for t in range(n_iter):
g = grad_f(x)
s = s + g ** 2
eta = np.sqrt(D / (s + 1e-8))
x = x - eta * g
D = max(D, np.sum(s) / (t + 1))
return x
we will also need to define an objective function f
and its gradient grad_f
, and an initial point x0
, and initial lower bound on the Lipschitz constant D0
, and the number of iterations n_iter
. We could do something similar to such:
def f(x):
return np.sum(x ** 2)
def grad_f(x):
return 2 * x
x0 = np.zeros(10)
D0 = 1
n_iter = 1000
x = d_adaptation(x0, f, grad_f, D0, n_iter)
This would minimize the function f
using D-Adaptation, starting at initial point x0
, with initial lower bound on the Lipschitz constant of D0
, and running for 1000 iterations. The final point would be returned in the variable x
.
Hi @kennethZhangML, thanks for taking a look. Our package is built on PyTorch so we'd like to use d-adaptation through a PyTorch optimizer class. It looks like this is already implemented in their codebase (linked above) as well as their newer version Prodigy. If it works nicely, we'll just want to update the dependencies and add some new kwargs to enable it as an option.
Tests perform worse after implementing this, randomly not converging. Seems to be overestimating LR. Closing.
https://openreview.net/forum?id=GXZ6cT5cvY https://github.com/facebookresearch/dadaptation