Players have an Elo-based ranking

jeffreywescott commented 8 years ago

@shereefb commented on Mon Aug 01 2016

Context

The game uses XP for two distinct purposes: The first to show how much someone has played the game and contributed to projects they were on. The second, to reflect player rating.
While XP is sufficient for the first purpose it's imprecise and misleading for the second for the following reasons:
1. Players join the game with different backgrounds/levels of experience. This is 'hacked' in the case of advanced players, and not represented in the case of new players.
2. Two 'beginner' players contributing equally to an easy project gain the same 'rating' as two 'intermediate' players contributing equally to a medium-difficulty project.
3. A 'beginner' player contributing equally on a team of beginners gains more 'rating' as an intermediate player contributing less on a team of experts.

To solve for this, the amount of XP a player gains on a project should be a function of their rank compared to the rank of the players they play with, in addition to their relative contribution and hours.

To do this, we need a separate stat that embodies rating, that's separate from XP.

Having a separate stat that represents rating allows us to:

Form teams based on people's skill level (rating) not based on how much they've played the game.
Better calculate XP distribution and relative contribution by using players' rating to adjust expected relative contribution.
Weigh feedback more heavily for higher rated players (e.g. project completion and quality)
Introduce players of different experience levels to the game at any time and have the team formation algorithm adjust for them quicker.

Proposed Solution

The Elo Rating System is widely used (chess, fifa, online gaming...etc.) as a system for rating two-team competitive games.

Every project retrospective can be seen us multiple two-player 'contests' with each player competing to contribute the most per hour spent on project.

For example, assume a project where the following players contribute the following:

player	hours	relative contribution
shereef	20	20%
tanner	40	40%
jeffrey	40	40%

Each player's relative contribution per hour is calculated by dividing their rc by hours:

player	hours	rc	rc/h
shereef	20	20%	1
tanner	40	40%	1
jeffrey	40	40%	1

The retrospective is now broken down into three two-player contests:

contest	rc/h : rc/h	winner
shereef v.s. tanner	1:1	tie
shereef v.s. jeffrey	1:1	tie
jeffrey v.s. tanner	1:1	tie

As a result of this example, none of the player ratings would change, but all of their XP would increase, with Jeffrey and Tanner's XP increasing by double what Shereef's XP increased by.

Taking another example, where the players have different skill levels and different contributions:

player	hours	rc	rc/h
shereef	20	40%	2
tanner	40	40%	1
jeffrey	40	20%	0.5

Breaking down the retrospective to three contests:

contest	rc/h : rc/h	winner
shereef v.s. tanner	2:1	shereef
shereef v.s. jeffrey	2:0.5	shereef
jeffrey v.s. tanner	0.5:1	tanner

Shereef wins two contests, and tanner wins one.

Considerations

There should be a range within which two players rc/h is considered a tie. For example, if Tanner and Jeffrey have rc/h of 0.9 and 0.92 respectively, it doesn't make sense to consider this a "win" for Tanner.
Related to 1, if someone's rc/h is thrice another players', they should get more of a rating bump than if it was just 1.5 times another players'. In other words, perhaps we should adjust the ELO ranking to consider 'degrees' of winning and losing. The way Elo rating is used in GO could offer a good starting point for tackling this. Also see margin of victory adjustments
K factor needs to be adjusted intelligently. It needs to start off large and decrease over time (total hours played?, total retros?, xp?), allowing player rankings to swing more wildly earlier and settle later in the game. K is varied depending on the rating of the players, because of the low confidence in (lower) ratings (high fluctuation in the outcome) but high confidence in pro ratings (stable, consistent play). In GO K is 116 at rating 100 and 10 at rating 2700
In Chess (or other ranked games) the games are played sequentially, and each ranking is adjusted before the next contest starts. In our case, multiple games are played in parallel. Here we have a tough choice, either:

a. Order games sequentially, and adjust player rankings after each contest. b. Adjust player rankings "in parallel"

Choosing a. gives us a more accurate distribution of rank, but disadvantages players based on how the games were ordered sequentially. Choosing b. is more fair, but does not give us as accurate a distribution.

For example, in the second example above, if Jeffrey loses to Tanner after Tanner loses to Shereef, then Jeffrey's final rating will be lower than if he loses to Tanner before Tanner loses to Shereef.

If we end up using a margin-based ELO (similar to GO) it might make more sense to run the games "in parallel"

@shereefb commented on Mon Aug 01 2016

@prattsj , @jeffreywescott , @bundacia heads up.

One of my main take aways from introducing stats to learners, is how "God stats" like XP that attempt to roll up many sub-stats maybe less useful than I previously thought they would be. XP is trying to do too much, and we're using it for at least two different purposes.

I've been thinking since Friday about using a rating/ranking algorithm along with XP, and wanted to capture my thoughts in a game design issue.

Wanted to give you guys an early heads up about this so that you can weigh in early and frequently as this comes down the pipelines.

Next step for me is to discuss with @tannerwelsh and if he's up for it, start to run some ELO ranking simulations with the current data set we have and see wether or not it gives us a more accurate/higher resolution picture of our learners as they stand (by comparing to XP, and to Jarred's, Jrobs, and Mihai's rankings)

There's a lot of tweaks to how ELO can be used (K-factor, parallel v.s. sequential, discrete v.s. range...etc.) so I suspect it will take a bit of playing around with the data before we have a sense of wether or not this is a useful stat.

I can't imagine this hitting engineering backlog in the next two or three weeks but I wanted to get in the habit of including you folks as early as possible in potential paradigm-changing issues. If you have the bandwidth would love your feedback/thoughts/ideas...etc.

@prattsj commented on Mon Aug 01 2016

Thanks, @shereefb. Feels really good to get a heads up on and have access to the convo about something this significant and complex so early. Will probably put off my own personal dive into the details here until after this week given our tight timeline, but I'm looking forward to coming up to speed. Super interesting stuff.

@shereefb commented on Tue Aug 02 2016

Running a quick and dirty script to calculate ELO ratings. Ordered games by cycle, and ran sequentially, initial K-factor of 80 for first 10 games, and dropping to 16 afterwards:

Name	Rating	XP	Elo Pod Rank	XP Pod Rank
Jared Grippe	1271	217	1	2
Mihai Banulescu	1232	154	2	3
John Roberts	1227	227	3	1
Nico	1040	57	4	12
Rachel	1034	71	5	8
Devon Wesley	1022	80	6	4
Phillip Lorenzo	1015	54	7	13
EthanJStark	981	76	8	6
Ej	974	23	9	20
Aileen Santos	971	73	10	7
Majid Rahimi	951	78	11	5
James D Stewart	946	62	12	10
Shaka Lee	928	61	13	11
John Hopkins	921	66	14	9
anasauce	902	30	15	18
Yaseen Hussain	890	49	16	15
Harman Singh	874	52	17	14
Syd Rothman	855	37	18	17
Moniarchy	828	40	19	16
Thomas W. Smith	825	29	20	19

https://gist.github.com/shereefb/5b7e707b439b66aa5079a8326fc1052b

On face value this is a better ranking than XP based on these initial observations:

EJ is no longer last. She only contributed to 1 project in the first cycle, and contributed as much as James did. If she was sick (instead of dropped out) for the next two weeks she would be ranked bottom based on XP which doesn't make sense.
Anasauce moved up three form third to last. She missed the first 2.5 days of LG so didn't gain us much XP, but her ranking shouldn't suffer from that. She held her own with Nico during on instinctive-nyala and contributed 1.5 times Thomas on unusual-woodpecker.
Yaseen and Harman move down the ranking. They are high on XP but they were with each other on 2 of 3 cycles.
Jared, Mihai and Jrob rose to the top without considering any "hacked" XP value from the past.

Other observations

Nico, Phillip, Majid and John H. had the most dramatic shift in Pod rank when moving to Elo (of those who didn't miss any time)
Nico and Phillip jumped a bunch. They were both on teams of 5 and 4 respectively. Large teams means less XP to go around???
Moniarchy got stuck on really strong teams: rachel and jared twice! and Jared and Nico. This gave her shit ranking since she has zero "wins". Elo did worse than XP here because of less diverse team formation (new shuffle in team formation should help with this, as would a margin-based ELO algorithm)
Rankings should get much better with each cycle (especially as teams are more diverse)
If rc/hour was within 10% of each other I considered it a tie. Changing this to 5% or 20% alters the rankings dramatically!

@shereefb commented on Tue Aug 02 2016

Running a margin-based ELO shows VERY different results

Name	Rating
Jared Grippe	1156
John Roberts	1128
Mihai Banulescu	1098
Devon Wesley	997
Rachel	990
John Hopkins	987
Nico	986
Ej	985
EthanJStark	985
James D Stewart	982
Phillip Lorenzo	980
Majid Rahimi	979
Aileen Santos	975
Shaka Lee	965
anasauce	956
Harman Singh	951
Yaseen Hussain	932
Syd Rothman	931
Thomas W. Smith	913
Moniarchy	913

game history

Player 1	Player 2	Result	P1 New Rating	P2 New Rating
Devon Wesley(1000)	Jared Grippe(1000)	0.19	975	1024
Devon Wesley(975)	Shaka Lee(1000)	0.53	979	995
Jared Grippe(1024)	Shaka Lee(995)	0.82	1046	972
Jared Grippe(1046)	Phillip Lorenzo(1000)	0.8	1064	981
Jared Grippe(1064)	Thomas W. Smith(1000)	0.83	1083	980
Phillip Lorenzo(981)	Thomas W. Smith(980)	0.55	985	975
Ej(1000)	James D Stewart(1000)	0.5	1000	1000
Ej(1000)	Jared Grippe(1083)	0.2	985	1097
James D Stewart(1000)	Jared Grippe(1097)	0.2	986	1110
Aileen Santos(1000)	John Roberts(1000)	0.25	979	1020
Aileen Santos(979)	Majid Rahimi(1000)	0.5	981	997
John Roberts(1020)	Majid Rahimi(997)	0.75	1037	979
Harman Singh(1000)	John Roberts(1037)	0.18	978	1058
Harman Singh(978)	Yaseen Hussain(1000)	0.47	978	999
John Roberts(1058)	Yaseen Hussain(999)	0.81	1075	981
Jared Grippe(1110)	Moniarchy(1000)	0.81	1122	987
Jared Grippe(1122)	Nico(1000)	0.81	1133	988
Moniarchy(987)	Nico(988)	0.51	988	986
anasauce(1000)	John Hopkins(1000)	0.34	986	1013
anasauce(986)	Mihai Banulescu(1000)	0.25	967	1018
anasauce(967)	Rachel(1000)	0.31	955	1011
John Hopkins(1013)	Mihai Banulescu(1018)	0.39	1004	1026
John Hopkins(1004)	Rachel(1011)	0.46	1001	1013
Mihai Banulescu(1026)	Rachel(1013)	0.57	1030	1008
EthanJStark(1000)	John Roberts(1075)	0.25	988	1086
EthanJStark(988)	Syd Rothman(1000)	0.61	998	989
John Roberts(1086)	Syd Rothman(989)	0.83	1101	973
Jared Grippe(1133)	Nico(986)	0.84	1144	974
Jared Grippe(1144)	Phillip Lorenzo(985)	0.85	1146	974
Jared Grippe(1146)	Syd Rothman(973)	0.9	1148	959
Jared Grippe(1148)	Yaseen Hussain(981)	0.91	1150	966
Nico(974)	Phillip Lorenzo(974)	0.51	974	973
Nico(974)	Syd Rothman(959)	0.63	982	950
Nico(982)	Yaseen Hussain(966)	0.65	991	956
Phillip Lorenzo(973)	Syd Rothman(950)	0.62	979	943
Phillip Lorenzo(979)	Yaseen Hussain(956)	0.64	987	947
Syd Rothman(943)	Yaseen Hussain(947)	0.52	944	945
Devon Wesley(979)	James D Stewart(986)	0.65	991	973
Devon Wesley(991)	Jared Grippe(1150)	0.23	986	1150
Devon Wesley(986)	Moniarchy(988)	0.78	1008	965
Devon Wesley(1008)	Rachel(1008)	0.56	1013	1002
James D Stewart(973)	Jared Grippe(1150)	0.14	962	1152
James D Stewart(962)	Moniarchy(965)	0.65	974	952
James D Stewart(974)	Rachel(1002)	0.41	970	1005
Jared Grippe(1152)	Moniarchy(952)	0.92	1154	939
Jared Grippe(1154)	Rachel(1005)	0.81	1155	996
Moniarchy(939)	Rachel(996)	0.27	927	1007
EthanJStark(998)	Jared Grippe(1155)	0.24	994	1155
EthanJStark(994)	Majid Rahimi(979)	0.52	994	978
Jared Grippe(1155)	Majid Rahimi(978)	0.77	1155	975
Aileen Santos(981)	Mihai Banulescu(1030)	0.31	971	1039
Aileen Santos(971)	Shaka Lee(972)	0.51	971	971
Mihai Banulescu(1039)	Shaka Lee(971)	0.7	1047	962
Harman Singh(978)	John Hopkins(1001)	0.37	970	1008
Harman Singh(970)	John Roberts(1101)	0.24	963	1107
John Hopkins(1008)	John Roberts(1107)	0.35	1007	1107
anasauce(955)	Mihai Banulescu(1047)	0.23	943	1058
anasauce(943)	Thomas W. Smith(975)	0.63	956	961
Mihai Banulescu(1058)	Thomas W. Smith(961)	0.85	1075	943
Harman Singh(963)	John Roberts(1107)	0.15	950	1119
Harman Singh(950)	Yaseen Hussain(945)	0.52	951	943
John Roberts(1119)	Yaseen Hussain(943)	0.86	1121	932
John Roberts(1121)	Syd Rothman(944)	0.89	1123	931
anasauce(956)	Jared Grippe(1155)	0.22	954	1155
anasauce(954)	Nico(991)	0.47	955	989
anasauce(955)	Phillip Lorenzo(987)	0.48	956	985
Jared Grippe(1155)	Nico(989)	0.76	1155	986
Jared Grippe(1155)	Phillip Lorenzo(985)	0.77	1155	981
Nico(986)	Phillip Lorenzo(981)	0.51	986	980
James D Stewart(970)	Mihai Banulescu(1075)	0.31	966	1078
James D Stewart(966)	Thomas W. Smith(943)	0.74	982	926
Mihai Banulescu(1078)	Thomas W. Smith(926)	0.86	1090	913
Devon Wesley(1013)	EthanJStark(994)	0.51	1011	995
Devon Wesley(1011)	Majid Rahimi(975)	0.48	1005	980
Devon Wesley(1005)	Mihai Banulescu(1090)	0.29	997	1097
EthanJStark(995)	Majid Rahimi(980)	0.48	991	983
EthanJStark(991)	Mihai Banulescu(1097)	0.28	985	1098
Majid Rahimi(983)	Mihai Banulescu(1098)	0.3	979	1098
Aileen Santos(971)	Jared Grippe(1155)	0.24	969	1155
Aileen Santos(969)	John Hopkins(1007)	0.53	976	999
Aileen Santos(976)	Shaka Lee(962)	0.51	975	962
Jared Grippe(1155)	John Hopkins(999)	0.78	1156	993
Jared Grippe(1156)	Shaka Lee(962)	0.77	1156	960
John Hopkins(993)	Shaka Lee(960)	0.48	987	965
John Roberts(1123)	Moniarchy(927)	0.91	1125	914
John Roberts(1125)	Rachel(1007)	0.86	1128	990
Moniarchy(914)	Rachel(990)	0.38	913	990

@tannerwelsh commented on Tue Aug 02 2016

Really interesting stuff here @shereefb, thanks for putting it together!

Side-note: please don't say "relative contribution" when you really mean "contribution" :)

@jeffreywescott commented on Tue Aug 02 2016

This seems far superior to how we've been using XP. OND, etc.

@shereefb commented on Tue Aug 02 2016

@tannerwelsh what's the difference between relative contribution and contribution? I've been using them interchangeably. What am I missing?

@shereefb commented on Tue Aug 02 2016

Guessing at Jared's, Jrobs, and Mihai's initial rating (setting them at 1500,1500,1400) gets us slightly better results. As players lose less rating points because we 'thought' really advanced players were level 1000 to start with.

K factor = 200 for first 20 games then moves to 16

Player	Elo Rating
John Roberts	1319
Jared Grippe	1287
Mihai Banulescu	1222
Devon Wesley	1079
Majid Rahimi	1076
John Hopkins	1066
Aileen Santos	1066
EthanJStark	1065
Shaka Lee	1064
Nico	1061
James D Stewart	1061
Rachel	1056
Phillip Lorenzo	1052
anasauce	1026
Ej	1021
Harman Singh	1005
Yaseen Hussain	976
Syd Rothman	964
Moniarchy	946
Thomas W. Smith	928

tannerwelsh commented 8 years ago

First attempt at integration: https://github.com/LearnersGuild/game-prototype/pull/40

Not yet reflected in Playbook, so this is not ready for review

tannerwelsh commented 8 years ago

Moved to review for @shereefb and @LearnersGuild/los

tannerwelsh commented 8 years ago

(see stat description in https://github.com/LearnersGuild/playbook/pull/48)

jeffreywescott commented 8 years ago

@shereefb -- it seems #13 has a hard dependency on this, so this will need to be tagged as RFI at the same time as #13, no?

shereefb commented 8 years ago

@jeffreywescott yep. #13 definitely depends on this.

shereefb commented 8 years ago

/cc @LearnersGuild/software this is RFI

We don't yet have the exact K-factor, but are tracking that issue here: #55

bundacia commented 8 years ago

@shereefb awesome. With @jeffreywescott out I'm not sure exactly how we want to get these things onto the implementation board. My preference would be for game mechanics to create a new ticket there and include the specs in the description (likely cut & pasted from the game mechanics issue). The description of this issue contains a long comment thread, so it's not exactly clear what the end result is. WDYT?

jeffreywescott commented 8 years ago

I suck at vacation.

I am working a bit on Monday and could move an issue or two then.

bundacia commented 8 years ago

@jeffreywescott we need to figure this out either way. Go back to vacation!

tannerwelsh commented 8 years ago

@bundacia would you be willing to fill the Prod. Dev. Flow role while Jeffrey's on vacay? I.e. you'd just do the moving of the cards from RFI?

jeffreywescott commented 8 years ago

I intend to do this work while on vacay, FWIW, but not every day, more like every week.

If things need a quicker turnaround, then it would be best for someone else to take on.

LearnersGuild / game-prototype

Players have an Elo-based ranking #9