Add d-adaptation to ignore learning rate tuning

cnellington commented 1 year ago

https://openreview.net/forum?id=GXZ6cT5cvY https://github.com/facebookresearch/dadaptation

kennethZhangML commented 1 year ago

We could do something like this:

import numpy as np

def d_adaptation(x0, f, grad_f, D0, n_iter):
    x = x0
    D = D0
    s = np.zeros_like(x)
    for t in range(n_iter):
        g = grad_f(x)
        s = s + g ** 2
        eta = np.sqrt(D / (s + 1e-8))
        x = x - eta * g
        D = max(D, np.sum(s) / (t + 1))
    return x

we will also need to define an objective function f and its gradient grad_f, and an initial point x0, and initial lower bound on the Lipschitz constant D0, and the number of iterations n_iter. We could do something similar to such:

def f(x):
    return np.sum(x ** 2)

def grad_f(x):
    return 2 * x

x0 = np.zeros(10)
D0 = 1
n_iter = 1000

x = d_adaptation(x0, f, grad_f, D0, n_iter)

This would minimize the function f using D-Adaptation, starting at initial point x0, with initial lower bound on the Lipschitz constant of D0, and running for 1000 iterations. The final point would be returned in the variable x.

cnellington commented 1 year ago

Hi @kennethZhangML, thanks for taking a look. Our package is built on PyTorch so we'd like to use d-adaptation through a PyTorch optimizer class. It looks like this is already implemented in their codebase (linked above) as well as their newer version Prodigy. If it works nicely, we'll just want to update the dependencies and add some new kwargs to enable it as an option.

cnellington commented 9 months ago

Tests perform worse after implementing this, randomly not converging. Seems to be overestimating LR. Closing.

cnellington / Contextualized

Add d-adaptation to ignore learning rate tuning #212