Better Trueskill evaluation

vwxyzjn commented 2 years ago

Continuing the thread from #43 here because #43 is closed.

@kachayev mentioned

Note sure this is the most convenient avenue for the discussion (let me know if you want to move this to Discord).

I am ok with both. Github issues are archived and easy to view for future users, which is nice. Whereas Discord is easier to do quick chat, so both have pros and cons.

I finally got time to play around with this proposal. I think it's a good idea! Playing games with least predictable outcome should reduce overall uncertainty about skill estimates. Initially I though this is used for matchmaking for training but it seems it's only used for evaluation. Empirically the same approach would be problematic when deciding on opponents for training but just for measurements.

Yes, currently it's only used for evaluation: the Trueskill of the reference agent are fixed and we only update the trueskill of the training agent.

There's still an interesting concern, or even a topic for research/investigations. TrueSkill relies on the prior estimate for draws. For some games, even when following random policy, there is a little chance to get draw (e.g. Chess, or Go). But in other games, it actually takes more sophisticated policy to avoid "drawing" all the time. Specifically keeping in mind practical limitation of using "timestep budget" that typically results in draw. Working with MicroRTS environment we get around this (partially) because we have scripted bots that are quite good from get go. Other options would be an SL policy (imitation learning) to bootstrap the initial population where the estimate of TrueSkill would have reasonably low sigma (and reasonably low prob of drawing).

Questions now are:

how sensitive algorithm to the quality of prior?

how to properly estimate draws prior for a given population?

how does it change with training?

and more

These questions are a bit outside of my realm, haha, but I think are interesting questions worth investigating.

Another question: do we need to re-evaluate skill for existing players when injecting new players into the population. Empirically for large populations, the answer is "yes". It would be interesting to run experiments here for a small size population and only a few newcomers. As we talk about latent variable estimate, we don't ground truth to compare to. But we can run simplified version with only a few match ups selected for each new player vs. full round robin. And measure how different the final result is.

The short answer is no at least in this paradigm. I am following OpenAI Five and AlphaStar's evaluation of fixing the trueskill for reference agents.

See

kachayev commented 2 years ago

Hey! More thoughts on the topic as I'm working with evaluation code. It seems like we rely on an assumption that is not as obvious I thought before, and probably needs a verification: both binary search and quality-based sorting assume that A > B and B > C means A > C. Which might not be always the case. There might be a niche strategy out there that doesn't give you much of a winrate but still can outplay a given agent consistently. In the case of trueskill, it seems like estimate of mean is going to be okay. But estimate of variance might overfit into draws. Assume the league structured as the following:

3 top agents that can outplay everyone
10 so-so agents that are mostly scoring draws
3 weird agents with niche strategies that mostly lose

Our prior should give us middle of the distribution, "draw" agents. Quality based estimate here should be heavily skewed towards playing against "draw" agents, because prior belief to lose to "weird" agents it low enough already (middle agents outplay lower agents, playing against middle agents and winning should settle mean MMR higher). As the implication most games goes into mid part of the leaderboard, and the algorithm decrease variance based on those games. When, in fact, epistemic uncertainty about mean estimate is still high.

I can try to build a probabilistic simulation to showcase the scenario. Overall, what do you think about the initial assumption? Would it be a problem for this specific environment? Should we validate it somehow?

As a safety measure, I think playing random games as a part of evaluation should be able to elevate concerns. Or, even more targeted, random games against agents with highest variance. WDYT?

vwxyzjn commented 2 years ago

This is a very practical consideration. As a preliminary fix, I randomly pick one of the three opponents with the highest quality scores, which IMO helps to a certain extent.

While the weird agent situation could definitely occur, it is unlikely the case with the current league composition: the bots with low true scale are things like passive AI and random AI. That said, I see this would be a bigger problem once we run into larger leagues.

I think playing random games as a part of evaluation should be able to elevate concerns. Or, even more targeted, random games against agents with highest variance.

I think this is a great idea. The only issue is time: doing matches like this could take a long time and does not reduce the agent's sigma.

kachayev commented 2 years ago

There's another problem with the way TrueSkill is used for evaluation right now: the result depends on the order of matches. Especially when we only have a handful of updates (compared to 50+ games per player that are typically used to analyze convergence properties). TrueSkill2 paper has an example of how this could go wrong even with just 8 teams in the tournament. TrueSkill2 solves this by introducing batch update stage using EP algorithm, and as far as I'm aware there's no open sourced implementation of EP inference except the one in Infer.NET (will be glad to be wrong on this).

I'm not sure what would be a good solution here, maybe running 1-to-1 inference over shuffled batch of games results (similar how we do ML training, just minibatch size = 2). This won't be a problem from computation perspective, as we only need to calculate Gaussian params over and over, and we only have a few opponents. But I don't know if we can get theoretical guarantees of the result not depending on the order of evaluations. I need to run deeper investigation, and maybe play around with evaluations that we already have.

Farama-Foundation / MicroRTS-Py

Better Trueskill evaluation #50