bernd-wechner / CoGs

Competitive Gamers Leaderboard Server (CoGs LS)
2 stars 4 forks source link

TrueSkill: Implement time-varying Tau (modelling skill decay properly) #13

Open bernd-wechner opened 5 years ago

bernd-wechner commented 5 years ago

There's a fundamental design issue with TrueSkill that emerged from my study of it.

Every time a new game session is logged it adds Tau to the standard deviation (uncertainty) of skill.

This is intended to model the rising uncertainty in your skill over the time since you played last. This uncertainty contains two major known components:

  1. A decline in skill over time without practice. In fact I have a paper on this and it's studied and understood and could in a sense be modeled but is immature research and not specific to games and the return on modelling it is likely to be negligible if present at all.

  2. A possible rise in skill over due to unrecorded practice (times you played this game but did not log results) or cross-learning (similar games you played and did log experience with which positively impacts your skill at the game concerned). None of this can be modeled as it relates to unknown activity between recorded sessions.

As a consequence there is some rise in uncertainty (or loss in confidence) in our assesment of your skill, over time. for lack of incoming strengthening evidence of it.

And TrueSkill models this by adding a fixed value Tau to the uncertainty on each run.

But we know that this is time varying. That our uncertainty grows with time. And so a better model would add some uncertainty that grows over time.

One way to model this is to redefined Tau as a per diem value (fancy Latin term for per day). If we assumed their Xbox models are based on say an average gape between plays of a certain duration, say a day or a week, we could redefine the default Tau to get the ball rolling as an equivalent per diem value.

This impacts leaderboard presentation also. Currently Tau is added whenever results are calculated (a session is logged), and that is easy to model as we know the last time each player logged a session for this game so can get the delta diem and multiply that by the per diem Tau to get a Tau to add to the variance.

The gotcha is that for "current" leaderboards, the delta diem is now minus the last session date. So the rating which is currently simply a function of mu and sigma must be a function of mu, sigma, tau and the delta diem. In short as time passes a player will slip down the leaderboard.

This is actually a good thing and improves TrueSkill dramatically I suspect (it remains a suspiscion until tested). It solves a longstanding problem that TrueSkill has. Namely someone can waltz in, play a 6 player game against the top 5 players in a leaderboard, win and now dominate the laderboard, and years later they are still there, with zero evidence, because as the other five players continue to lay they are now and again losing (as happens in 6 player games) and their rating more or less stabilises over time, but the reward they receive from recording more sessions is that they lost rating strength in this way.

By moving Tau to per diem, old players who simply aren't maintaining their skills slide slowly down the leaderboard.

This does also introduce a new problem namely of unchecked decline, or growth in uncertainty. There is a case for example not to let uncertainty ever grow beyond it's initial default value (that which we grant newcomers) and in fact with one game recorded or n games recorded to have some minimum value higher than that default.

There may also be an argument to approach that not linearly with time, but probably more like nature dictates in an exponential drop (worth reviewing that paper I have to see if this proves to be the case in studied examples). An exponential drop sees the value drop fast initially and flatten out asymptotic to that minimum or limit.

There's reason to suspect that reality is the decay is probably more like CDF of a Laplace distribution:

https://en.wikipedia.org/wiki/Laplace_distribution

than exponential, namely skill is fairly stable for a while, then declines with a known rate towards an asymptotic minimum.

We could model Tau not as a per diem, but a time function that follows this CDF pattern with the minimum being the default plus some credit for the n games logged (evidence at hand) and teh initial value being uncertainty at last recorded session and we look delta diem up and get a value between them.

This would produce a nice decline in confidence that far better model expectations, and is within the bounds of utility in a sense for TrueSkill.

bernd-wechner commented 4 years ago

This page caught my eye:

http://www.beigebag.com/case_sigmoid_half.htm

or

https://web.archive.org/web/20200203173206/http://www.beigebag.com/case_sigmoid_half.htm

I think we can model this as a Sigmoid. Which of course is in fact the same thing as the Laplace CDF. I guess it's nice to have another name for it in the transition space, as we're considering a transition from one Sigma to another Sigma over time (this is what Tau modifies). If we imagine Sigma_1 is the Sigma last known after a recorded game session, and Sigma_2 is is the steady state uncertainty we have knowing someone once played at a certain level a long time ago, but we have no idea if they've played since.

Sigma_2 is arguably Sigma_default the standard Sigma we apply to any new player. And so we have a transition form Sigma_current to Sigma_default.

Without much evidence just gut feel we would expect exponential decay (nature graces us with that almost everywhere) but the Sigmoid transition allows us to tune the exponent and ideally we'd tune it based on some real world data or study, lacking that, we still want to model Sigma growth which translates to ranking decline over time because this is what encourages replay of games ... without replay our confidence in your skill goes down and you fall off the leaderboard ...

zass30 commented 2 years ago

Namely someone can waltz in, play a 6 player game against the top 5 players in a leaderboard, win and now dominate the leaderboard, and years later they are still there

I've been considering the same problem of "leaderboard camping" for my implementation. I'd like to add a few thoughts on decay.

I think decay may be modeled with three parameters:

1) Time There should be a decay function of time. The time parameter may be more complex than just the amount of time played since last game. In your example above, that person that beat the top 5 players should not be able to reduce their decay to zero by playing just one game against a completely new player. The parameter may be a data structure that looks at the number of games played in the last week, or month, for example.

2) Rank High ranking players should decay faster than low ranked players. In any game with a top N leaderboard, those top N spots are very competitive and coveted. These players must play often to maintain their position. On the other hand, new players and low ranked players do not need an aggressive decay parameter. There is no need to reduce the ranking of a low level player with decay.

3) Variety This measures variety of opponents. Someone that is just playing against low level players, or against just one person repeatedly, should have more decay than someone that plays against a wide variety of opponents. This measure is meant to address "farming", where people just "farm" lower ranked players for points.

bernd-wechner commented 2 years ago

Thanks for the thoughts.

I haven't had a chance to address this in any form on this particular project alas (snowed under with all sorts of competing commitments), but an injection of thought, attention and ideas is always stimulating ....

Given TrueSkill's approach, attempting to model skill estimation on the basis of evidence using a Bayesian approach, and that it's incorporated a factor (Tau) to model skill decay (albeit very poorly model it) I am inclined to approach the problem in a compatible way - namely from the perspective of modelling skill decay (for which I do have some albeit scant, literature to read and review) based on evidence.

To wit, I think the most attractive approach is to convert Tau from a constant property to a function of some measures.

The only easy measures available in a database of logged results (as this project stores) are:

  1. The calendar time since this game was last played (according to records)
  2. The calendar time since any game was last played (according to records)
  3. Your current rating at this game
  4. How your current rating compares to others

It strikes me that one could define metadata for games that make other measures possible, and such metadata may already be available on BGG for example. Of particular use would be properties that characterise games (like the mechanisms BGG records - community provided metadata). With these available, another two measures are available:

  1. The properties of the game this concerns
  2. The calendar time since any similar game was last played (where similarity is defined by the game properties of 1.)

The problem with anything using 6 is that it adds a whole pile of complex subjectivity to the equation, and utilising 2 is tantamount to considering all games similar.

But we currently apply TrueSkill to every game independently and while it is true that decay is impacted by play of related games, so to skill acquisition. That is, we could start you at a higher rating the first time you play a game similar to one you rate well in. But that is not being done and rapidly becomes scope creep (on this issue, i.e. modelling the reality that your skill in any two different games is not independent, but related to the similarity between them is another issue, and worth filing!)

So if we resist scope creep, this issue would perforce limit itself to using 1 and maybe 3/4/5. That is tau = Tau(t, g, r, l) where t is the time since this game was last played and g is the game (from which its properties can be read), r is your rating at that game and l is the leaderboard (all players ranked ratings at that game).

This would allow modelling some of what you've listed.

Some games may decay faster some slower, in fact this is related to Beta property TrueSkill already uses, and it might be that Beta is the parameter of interest. Beta of course is not currently modelled, and we're using a standard default across all games.

And it would be possible to decay based on ranking (so the higher you rank, the faster you decay) and the lower you rank, the slower. This can be modelled with a simple function that secures as a condition that, for a given game G, this decay does not change the ranking among non-players. That is, if the game is no longer played by anyone, then the skills of all players decays over time, but they remain ranked as they are (it does not impact their rankings).

To complicate matters, there are two parameters in TrueSkill that can be modified to model decay, they are the estimate skill (Mu) and the (lack of) confidence in that estimate (Sigma). And it would be fair to expect that Mu goes down and Sigma up during non-play. The questions being how much and by what functions.

To with for basic model, I had intended to review some literature I collected on skill decay in other contexts.

zass30 commented 2 years ago

Thanks for the thoughtful response.

The only easy measures available in a database of logged results (as this project stores) are: The calendar time since this game was last played (according to records) The calendar time since any game was last played (according to records)

I don't think it's too hard to do better than this. As I said in my first comment, if we only look at at calendar time since game was last played, this allows for extreme cases like someone not playing for years, playing one game against a novice, and then they are back on top of the leaderboard. One could log the number of games played in various rolling windows. For example, games played in the last 24 hours, last week, last 30 days, and last year. Keeping track of each of these windows only requires an additional pointer to the end of the respective window. One could then use these additional parameters to compute a more robust decay system. For example, one might say that at the highest level, there needs to be a minimum number of games played in the daily, weekly, and monthly leaderboards to make the decay go away.

Your idea of Tau is very similar to the one applied in TrueSkillThroughTime (https://github.com/glandfried/TrueSkillThroughTime). This is a very interesting extension of TrueSkill that uses all priors to compute the next posterior. In other words, every time there is a new game you have to recompute the entire history of games from scratch. This is fine for small games, but for an online game with tens of millions of games, it's impractical. But if you're doing game systems with small numbers (less than a few million games) such as a BGG, it's probably quite good.