Closed vwxyzjn closed 2 years ago
@kachayev What are your thoughts on using such an evaluation to replace the binary search paradigm?
@vwxyzjn
"Quality" here stands for quality estimate done by TrueSkill, right?
Let me dig a little bit into details of the implementation. I also want to do a quick simulation for doing so in with a small-ish number of agents in the population. The key concern in the matchmaking process is always not to overfit into subpopulations (which is unlikely to happen with such a small number of players). But that needs to be verified. And it's always simple to dodge by having a bernoulli random to allow you to play a random game (the question is on the best parametrization of this). Another concerns from the perspective of selfplay: empirically it's necessary for the agent to play both easy and hard opponents (hard opponents gives you learning signal, easy opponents act as a regularizer against high variance). I need to play around with a system to see i sampling based on quality gives you both. I'll get back as soon as my thoughts are materialized!
@kachayev I am going to merge this now to run some experiments, but happy to revisit later :D
Note sure this is the most convenient avenue for the discussion (let me know if you want to move this to Discord).
I finally got time to play around with this proposal. I think it's a good idea! Playing games with least predictable outcome should reduce overall uncertainty about skill estimates. Initially I though this is used for matchmaking for training but it seems it's only used for evaluation. Empirically the same approach would be problematic when deciding on opponents for training but just for measurements.
There's still an interesting concern, or even a topic for research/investigations. TrueSkill relies on the prior estimate for draws. For some games, even when following random policy, there is a little chance to get draw (e.g. Chess, or Go). But in other games, it actually takes more sophisticated policy to avoid "drawing" all the time. Specifically keeping in mind practical limitation of using "timestep budget" that typically results in draw. Working with MicroRTS environment we get around this (partially) because we have scripted bots that are quite good from get go. Other options would be an SL policy (imitation learning) to bootstrap the initial population where the estimate of TrueSkill would have reasonably low sigma (and reasonably low prob of drawing).
Questions now are:
Another question: do we need to re-evaluate skill for existing players when injecting new players into the population. Empirically for large populations, the answer is "yes". It would be interesting to run experiments here for a small size population and only a few newcomers. As we talk about latent variable estimate, we don't ground truth to compare to. But we can run simplified version with only a few match ups selected for each new player vs. full round robin. And measure how different the final result is.
This PR proposes a TrueSkill evaluation scheme that
Empirically, I have found it converge sooner and play more diverse matches.
Compared to the binary search approach, this PR's scheme would be more computationally expensive when the league size is large. However, in the situation when we fixed the reference agents, this approach seems to work great empirically.