Open vwxyzjn opened 2 years ago
Hey! More thoughts on the topic as I'm working with evaluation code. It seems like we rely on an assumption that is not as obvious I thought before, and probably needs a verification: both binary search and quality-based sorting assume that A > B and B > C means A > C. Which might not be always the case. There might be a niche strategy out there that doesn't give you much of a winrate but still can outplay a given agent consistently. In the case of trueskill, it seems like estimate of mean is going to be okay. But estimate of variance might overfit into draws. Assume the league structured as the following:
Our prior should give us middle of the distribution, "draw" agents. Quality based estimate here should be heavily skewed towards playing against "draw" agents, because prior belief to lose to "weird" agents it low enough already (middle agents outplay lower agents, playing against middle agents and winning should settle mean MMR higher). As the implication most games goes into mid part of the leaderboard, and the algorithm decrease variance based on those games. When, in fact, epistemic uncertainty about mean estimate is still high.
I can try to build a probabilistic simulation to showcase the scenario. Overall, what do you think about the initial assumption? Would it be a problem for this specific environment? Should we validate it somehow?
As a safety measure, I think playing random games as a part of evaluation should be able to elevate concerns. Or, even more targeted, random games against agents with highest variance. WDYT?
This is a very practical consideration. As a preliminary fix, I randomly pick one of the three opponents with the highest quality scores, which IMO helps to a certain extent.
While the weird agent situation could definitely occur, it is unlikely the case with the current league composition: the bots with low true scale are things like passive AI and random AI. That said, I see this would be a bigger problem once we run into larger leagues.
I think playing random games as a part of evaluation should be able to elevate concerns. Or, even more targeted, random games against agents with highest variance.
I think this is a great idea. The only issue is time: doing matches like this could take a long time and does not reduce the agent's sigma.
There's another problem with the way TrueSkill is used for evaluation right now: the result depends on the order of matches. Especially when we only have a handful of updates (compared to 50+ games per player that are typically used to analyze convergence properties). TrueSkill2 paper has an example of how this could go wrong even with just 8 teams in the tournament. TrueSkill2 solves this by introducing batch update stage using EP algorithm, and as far as I'm aware there's no open sourced implementation of EP inference except the one in Infer.NET (will be glad to be wrong on this).
I'm not sure what would be a good solution here, maybe running 1-to-1 inference over shuffled batch of games results (similar how we do ML training, just minibatch size = 2). This won't be a problem from computation perspective, as we only need to calculate Gaussian params over and over, and we only have a few opponents. But I don't know if we can get theoretical guarantees of the result not depending on the order of evaluations. I need to run deeper investigation, and maybe play around with evaluations that we already have.
Continuing the thread from #43 here because #43 is closed.
@kachayev mentioned
I am ok with both. Github issues are archived and easy to view for future users, which is nice. Whereas Discord is easier to do quick chat, so both have pros and cons.
Yes, currently it's only used for evaluation: the Trueskill of the reference agent are fixed and we only update the trueskill of the training agent.
These questions are a bit outside of my realm, haha, but I think are interesting questions worth investigating.
The short answer is no at least in this paradigm. I am following OpenAI Five and AlphaStar's evaluation of fixing the trueskill for reference agents.
See