andyljones / boardlaw

Scaling scaling laws with board games.
https://andyljones.com/boardlaw
MIT License
38 stars 7 forks source link

activelo initializations #11

Closed andyljones closed 3 years ago

andyljones commented 3 years ago

This issue was shifted over from the private activelo repo; no point in keeping it separate now boardlaw's public

Hey Mario! I've carved out the active-ELO-learning chunk of my project into this repo. There's a lot about it I'm not happy with, but the bit relevant to geotorch is in a fairly good state.

For end-users (ie, me) the package is meant to be used as per examples.suggestions.simulate: the user repeatedly plays some games among the agents, uses the activelo.Solver to fit a distribution to the posterior, then has activelo.suggest take that posterior and use it to suggest which matchup to play next.

For geotorch purposes, the whole suggestion thing is irrelevant except maybe as way to generate test data. The bit relevant to geotorch is the solver, in activelo.solvers. It has some light tests at the bottom of the file, and some examples in examples.solvers.

There're two solver examples included in the repo, as listed in examples.solvers.saved_examples(). First is an organic example, collected while working on my main AZ project. Second is an artificial one that seems to reliably misbehave, either due to negdef covariance or by a line-search failure. The misbehaving example is a lot bigger than the ones I'm running into day-to-day, so I'm not too pressed to fix them up yet.

If you'd like more examples, try running examples.suggestions.simulate_log_ranks and then taking a slice of the trace.

Eyeballing the imports, what's required to get this working should be

pip install torch torchvision numpy scipy matplotlib geotorch matplotlib

I'd also recommend pip install aljpy, just sos you can use aljpy.extract().

Some bits of the repo also make fair use of dotdicts/arrdicts, which are a very-much-idiosyncratic datatype I'm extremely fond of for research work.

I've basically thrown this together in an afternoon. I'm happy to clean it up and answer questions if you find anything confusing or hard to follow :)

Finally, as I mentioned in the email: if you look through this and think 'lol never mind', there'll be no hard feelings!

andyljones commented 3 years ago

In case it helps, the following is my brief markdown-latex notes on the maths involved in the VB derivation. Install this Chrome extension to get it to render.

Each agent $i$ has a 'skill' $s_i \sim N(\mu_0, \sigma0)$. In a matchup with another agent $j$, the probability of $i$ winning depends on the logit'd difference in skills: $$ p{ij} = \frac{1}{1 + e^{-(s_i - sj)}} $$ and the number of wins in $n{ij}$ games between $i$ and $j$ is drawn from the binomial distribution $$ w{ij} \sim \text{Binom}(n{ij}, p{ij}). $$ Some bits of the above equations are going to turn up a lot in what follows, so let's shorten $p{ij}$'s definition into $p{ij} = \phi(d{ij})$, where $d_{ij} = s_i - s_j$.

Now I want to apply variational Bayes here, with a full-rank normal approximation $q(s) = N(s|\mu, \Sigma)$. Of the many approaches to variational Bayes, our specific approach is going to be gradient ascent on the evidence lower bound. That means repeatedly evaluating

$$ \nabla \text{ELBO} = \nabla (\mathbb{E}_q[\ln p(w, s)] + H[q]) $$

Of the two terms on the RHS there, $H[q]$ - the entropy of the normal distribution - is pretty easy! So let's have a look at the hard term, and let's look at the thing we're taking the expectation of there: $$ \ln p(w, s) = \left[ \sum w{ij} \ln p{ij} + (n{ij} - w{ij}) \ln (1 - p_{ij}) \right] - \left[ \frac{1}{2\sigma_0^2}\sum_i(s_i - \mu_0)^2 \right] + C $$ First term comes out of the binomial distribution; second term comes from the normal prior; and the $C$ captures some independent-of-$s$ terms we don't care about.

How's this change under the expectation? Let's go one term at a time. First up is

$$Eq \ln p{kj} = E [ \ln \phi(s_k - s_j) | N(s|\mu, \Sigma) ]$$

This looks like a misery to evaluate, and it is. But one fortunate thing is that we're only depending on $\mu, \Sigma$ constrained to $j, k$. More, the tricky bit - $\ln \phi(s_k - s_j)$ - depends only on $d = \frac{1}{2}(s_k - s_j)$. It's independent of the orthogonal direction $m = \frac{1}{2}(s_k + s_j)$! So we can change variables from integrating over $s_k, s_j$ to integrating over $d, m$, then integrate over $m$ on the inside. This leaves us with $$ Eq \ln p{kj} = E[ \ln \phi(d) | N(d|R\mu, R\Sigma R^T) ] $$ where $R = [+\frac{1}{2}, -\frac{1}{2}]$ changes the vars from $(s_j, s_k)$ to $(d, m)$ and simultaneously marginalizes out/drops $m$.

By marginalizing out $m$, we're left with only the integral over $d$. And that only depends on two parameters $\mu_d, \sigma^2_d$, so we can pre-emptively cache the integral's value for the full range of likely values! Then evaluating the expectation becomes a simple interpolation between the cached values.

andyljones commented 3 years ago

I also tried the alternative PSD implementation you sent me, and found it a lot less stable on the problems I tried it with - principally the saved_example('data/2020-11-27 21-32-59 az-test symmetric.npz'). I noticed it inherits from Symmetric rather than SymF like the normal implementation does; is that intentional? Bit out of my depth here.

lezcano commented 3 years ago

Hi Andy!

Thanks for all this! I'll give it a try sometime tomorrow. So, just to check that I am not missing anything, the bit that uses GeoTorch is the class Solver in activelo/solvers.py, right? Then, I should be able to use it simply calling saved_example in activelo/examples/solvers.py, and that should give me one of those fantastic plots that you posted on Twitter. This is great! Again, I'll play with it tomorrow and come back to you with what I find.

For what is worth, there might be two reasons why the other PSD implementation works worse. For one, as I said, the current way the initialisation of parameters is handled in geotorch is a bit of a mess. It will get better once I rollout the improved version of the Parametrisation framework. For example, as it stands. when you set a parametrisation, it does not keep the previous weight, but it samples a new one. This is because sometimes sampling a matrix in a manifold is not as simple as just setting it to the identity matrix (think low-rank optimisation). In any case, I'll try to test the PSD class vs the PSDReductive class using the same initialisation, see if it still is much less stable (it might very well be the case).

And yes, the implementation is correct by using the Symmetric rather than SymF, they are just two different parametrisations of the PSD matrices, in the same way that f(x) = |x| and f(x) = x² are two different parametrisations of the non-negative real numbers :)

andyljones commented 3 years ago

Thanks for all this! I'll give it a try sometime tomorrow. So, just to check that I am not missing anything, the bit that uses GeoTorch is the class Solver in activelo/solvers.py, right? Then, I should be able to use it simply calling saved_example in activelo/examples/solvers.py, and that should give me one of those fantastic plots that you posted on Twitter. This is great! Again, I'll play with it tomorrow and come back to you with what I find.

examples/solvers.py rather than activelo/examples/solvers.py, but yep! Or, well, the descendent of that graph. saved_example('data/2020-11-27 21-32-59 az-test symmetric.npz') should give you

image

Left to right,

I expect you'll care more about the first two than the last two. If you capture the output into a variable, you'll have a datastructure with all the information from the plots

>>> soln = saved_example('data/2020-11-27 21-32-59 az-test symmetric.npz')
>>> soln
arrdict:
μ        ndarray((67,), float32)
Σ        ndarray((67, 67), float32)
μd       ndarray((67, 67), float32)
σd       ndarray((67, 67), float32)
trace    arrdict:
         l           ndarray((76,), float64)
         gradnorm    ndarray((76,), float32)
         relnorm     ndarray((76,), float32)
         Σ           ndarray((76, 67, 67), float32)

and inspect soln.

Thanks for explaining the initializations thing - I realised something was up after noticing my 'warm starts' of successive optimizations... did not go well. Starting afresh every time and switching LBFGS to strong_wolfe linesearch made for two substantial improvements in stability.

andyljones commented 3 years ago

NB: Just realised I use a lot of unicode chars in this. If you're using vscode, I have an extension to tab-complete latex characters. And Jupyter has it built in!

If you don't use vscode and your dev env doesn't have anything equivalent, I might just have to find-replace them all.

lezcano commented 3 years ago

Yeah, I realised. I use vim, so for now I'll just copy and paste them if I need to mess around with them :)

lezcano commented 3 years ago

For what is worth, I have not forgotten about this, I just have not had the time yet to get onto it. I will try to run it and play with it sometime this week :)

andyljones commented 3 years ago

Don't sweat it - you're out-and-out volunteering your time here, there's no obligation to fulfil :)

lezcano commented 3 years ago

Hi again Andy!

So sorry for not responding for a WHOLE MONTH. Christmas, almost finishing the thesis, etc etc. But in any case, I hope you are well and I hope you've had a good break!

So, there's good news on the GeoTorch side. I have implemented a proper initialisation system. I am sure you remember that it was not very clear how the initialisation worked before with GeoTorch. Well, now you can do as follows:

self.register_parameter("Sigma", nn.Parameter(torch.empty(n,n)))
geotorch.positive_definite(self, "Sigma")
# At this point, self.Sigma is initialised to a random PSD matrix following a Wishart distribution with diagonal covariance. See
# https://geotorch.readthedocs.io/en/latest/psd/psd.html#geotorch.PSD.sample

# We can also initialise it to be the identity matrix by simply assigning to it
self.Sigma = torch.eye(n)

And that's all there's to it. You can use any way of sampling a matrix just assign the sampled matrix to the parameter. It should work, provided that the sampled matrix is PSD, of course.

Also, the current initialisation is "more correct" (read as in "I fixed something and now it should work better") than the one before, so if you just rerun the code, you should hopefully see some improvement (at least theoretically, perhaps it does not change anything in practice...)

All this is already in the master branch, so you can play with it. At the moment, it is not really reflected in the documentation, I have to spend some time doing this, which I think I'll do sometime this week. I hope I'll be able to add some examples and notebooks also explaining how everything works behind the scenes, which is something that you very rightfully mentioned that it is not very clear.

In any case, if you get around to try all these things, please tell me how they feel, and whether they are reasonable!

andyljones commented 3 years ago

Great, thanks for the update! I'm down a different branch of work right now, but should work my way back to the top in the next few days. Will tell you how it turns out.

andyljones commented 3 years ago

Hi again Mario!

So sorry for not responding for a WHOLE MONTH. No good excuses on my behalf I'm afraid, it just took a long time for my attention to circle back round to activelo. But today it did, because I realised that a lot of the noise in my elo estimates was coming from activelo's optimization process rather than the data itself. I think the fundamental problem is that the soln to these problems lies in a very wide, shallow basin, and without some very strict termination criteria that verge on numerical instability, the optimizer isn't going to bother finding the actual minimum rather than something that's almost the minimum.

Fortunately I'm a-okay with slightly suboptimal solutions. What I wasn't okay with was getting a wildly different suboptimal solution on each run, causing my estimated elos to jump around. So I tried out your initializations, and they worked a charm! Twice as fast on my toy problem, and a damn sight more stable:

image

Could do without the plateaus, but I suspect that can be solved with a bit of termination criteria tuning. Anyway, thanks for solving a problem for me once again! 🎉

lezcano commented 3 years ago

Always happy to help :)