!Learner stats match our subjective reality, elicit honest feedback, and promote self-directed, collaborative growth
!Game is balanced
Game feedback loops are all closed
Benefits
What are the benefits of this change, and whom do they impact?
Incentivizes honest reviews - downward pressure on completeness/quality (better trust in system, benefits everyone)
Helps all learners develop sense of their own estimation tendencies (feedback++)
Description
Changes
Introduce two new review stats: Review Accuracy and Review Bias for each review dimension: quality and completeness.
Note that both rely upon the notion of internal and external reviews. An internal review is a project review completed by a member of the project, whereas external reviews are those completed by players who were not on the project team.
For the sake of simplicity and incremental changes, the first iteration of review stats will mimic how estimation accuracy/bias are calculated: compare the learner's score to the average of their teammate's scores.
Review Accuracy
Review accuracy is a measure of how close a player's internal reviews are to the mean of their teammates' reviews. There is a review accuracy stat for both quality and completeness.
It is calculated by running the following algorithm:
for each internal review of player
find mean of all internal completeness||quality reviews for project (excluding player's own review)
set result to avgTeamReview
subtract avgTeamReview from player completeness||quality review
set result to projectReviewBias
subtract absolute value of projectReviewBias from 100
set result to projectReviewAccuracy
find mean of all projectReviewAccuracy scores
Review Bias
Review bias is a measure of whether a player tends towards under- or over-estimating the completeness and quality of their projects when compared to the mean of their teammates' reviews. It can be positive or negative. There is a review bias stat for both quality and completeness.
It is calculated by running the following algorithm:
for each internal review of player
find mean of all internal completeness||quality reviews for project (excluding player's own review)
set result to avgTeamReview
subtract avgTeamReview from player completeness||quality review
set result to projectReviewBias
find mean of all projectReviewBias scores
Tasks / Sub-issues
[ ] Add stats to playbook
[ ] Compute stats in game
[ ] Surface stats via /report/playerStats CSV export in game
Context
Right now there is no "downward pressure" on project reviews. Players are incentivized to inflate their own reviews, but there is no counterbalancing measure. As a learner, I also do not know if my reviews tend to over- or under-estimate, when compared with my teammates.
Note: this current model assumes a "wisdom of the crowds" approach, whereby the mean review score is taken as the "most correct" review. There are still many problems with this approach, but it is good enough for an MVP to help balance the game.
This is the biggest thing that jumps out at me here, and I'm not sure I agree that it's ok as an MVP. I think the bias stat makes a lot of sense, but having an reviewAccuracy stat that is really just a measure of how much I agree with everyone else seems risky. Maybe too risky for an MVP.
It can be gamed by collusion. It could encourage external reviewers to discuss their opinions before submitting reviews, which could cause players to be less honest in favor of going with the crowd for fear their own stats would suffer. Are we making forward progress if we're fixing one gameable stat with another?
2.
Should we split out my accuracy in rating my own work from my accuracy in rating other projects? If I rate 4-5 other projects every week those ratings will contribute a lot more to my reviewAccuracy than my own self-reviews. That being the case I could still inflate my own scores and "hide" it by reviewing lots of other project as accurately as I can.
3.
Same thing with the bias stat. Seems very likely that my bias could be in one direction for my own work, and in another direction for other teams' work.
Good points @bundacia. I think these really should be focused on how well internal-review compares to external-review, excluding external-to-external comparisons.
Ok, after discussing with @shereefb we decided to take a smaller, more incremental step. Updated the description so that we are only comparing between internal reviews. Pretty much the same calculation as with Estimation Accuracy & Bias.
The reason for this is threefold:
Removes the dependency on external reviews, so we can calculate these scores when all project retros are complete. Otherwise we're back in the 💩 we were in when XP depended on external project reviews.
Makes the stat easier to understand for learners: learners' reviews are only evaluated against their teammates', not other external players'.
Keeps some pressure for integrity/honesty
The drawbacks are obvious: as @bundacia pointed out, it leaves room for collusion. As do many other stats. At this point, I'm of the opinion that we should try to bake anti-collusion measures into our stats if that means that the stats become excessively complex.
In other words, I'm using a strategy of "simpler/more understandable stats over defending against collusion". Hmm, maybe that should be a circle strategy... [goes to Glassfrog & Asana]
seems like "Review Accuracy" isn't telling me anything that Review Bias doesn't tell me. Review bias tells me how far off I am and in what direction, Review Accuracy just tells me how far off I am. Do we really need both? Seems like every stat we add cost us a little bit in how much the learner needs to understand, how much stace we need to display the stats in any UI, etc. Just want to make sure we're really getting value out of both of these.
My prediction is that stat will incentivize people to talk with their team to get a sense of how complete/correct the project is before filling out the retro. We'll pretty quickly see everyone's "Accuracy" get pretty close to perfect. I'm not really sure how this achieves this benefit:
Incentivizes honest reviews - downward pressure on completeness/quality (better trust in system, benefits everyone)
Does this really create downward pressure? We're only comparing your review with other people with the exact same potential for upward bias.
It seems like we've made a decision that we cannot afford to wait untill all reviews are completed to calculate stats and I think that's the wrong way to go. I know we want quick feedback loops, but I don't think learners' learning is going to be greatly impaired by having to wait a few extra hours (or less?) for their stats. Especially if the price of a quicker response is a bunch of stats that are less valuable. Can't we just wait for all the reviews to come in and then compute stats that compare internal reviews with external reviews so that each player can see their bias when rating their own work and their bias when rating other players' work?
Yes, I think we do. Because if we just had bias, it can obscure useful feedback. For example, if I'm off by -25% half the time and +25% the other half, then my bias is 0%. Without accuracy, it would seem that I'm all good, and doesn't show another side of the situation, which is that I'm consistently inaccurate, just in different directions.
My prediction is that stat will incentivize people to talk with their team to get a sense of how complete/correct the project is before filling out the retro.
Probably true. As mentioned, we're not attempting to solve the temptation towards collusion in this issue. That's a much larger issue, and no easy fixes.
There's another way to think of this though: the reward:cost ratio for honesty needs to be larger than the reward:cost ratio for collusion/dishonesty. In my mind, reviewing my own project honestly has a higher reward:cost ratio than not because:
It doesn't impact my red zone or how I get assigned to teams (i.e. it's consequence-free)
Cognitive dissonance sucks (I don't want to lie if I don't have to)
Commenting on the completeness & quality has a lower social cost (I'm not evaluating a person, but a thing)
I get feedback on my evaluation skills, which can lead to learning
It seems like we've made a decision that we cannot afford to wait until all reviews are completed to calculate stats and I think that's the wrong way to go. I know we want quick feedback loops, but I don't think learners' learning is going to be greatly impaired by having to wait a few extra hours (or less?) for their stats
It's less a matter of keeping feedback loops short (although I want to do that too) than it is avoiding over-complexifying the system. Right now our system is complex, overly interdependent, and incredibly hard to reason about.
Eventually, yes I think that external reviews will certainly need to play a role. But we are not there yet. Our system can't sustain that level of complexity and randomness at the moment, and it creates a sub-par experience for learners.
... if we just had bias, it can obscure useful feedback. For example, if I'm off by -25% half the time and +25% the other half, then my bias is 0%. Without accuracy, it would seem that I'm all good, and doesn't show another side of the situation, which is that I'm consistently inaccurate, just in different directions.
Maybe way the accuracy score gets calculated is wrong in the description then. Right now it says:
subtract absolute value of projectReviewBias from 100
So if you're bias is 0% (because you're +25% half the time and -25% half the time) your accuracy will be 100%.
It's less a matter of keeping feedback loops short (although I want to do that too) than it is avoiding over-complexifying the system.
I'm not sure I agree that calculating stats based on all of the responses instead of just some of them is more complex. It's certainly not more complex from a code standpoint (my guess is it would be simpler). And it seems like "these stats are calculated based on all of the reviews" is simpler to understand than "these stats are calculated as soon as 'enough' reviews have been submitted". Maybe there's some type of complexity I'm not aware of in play here.
Just wanted to weigh in on the feedback loop - don't we use that to determine if folks are able to vote for different levels of projects (specifically the apprentice level/OSS projects, and eventually the LOS work)? Given that it's needed as an input to voting, if we don't have the correct stats by voting time, it feels like we should dispense with the voting constraints in other areas of the game?
the accuracy calculation looks off, but it actually works out because the subtraction happens on a project-level, not an aggregate level. So you find the absolute value of the bias for each project, subtract it from 100 for each project to get project-level accuracy, and then take the average. Since each project's accuracy will likely be below 100, the average will almost always be below 100. The only way to get perfect accuracy is to have 0% bias every single time.
I'm not sure I agree that calculating stats based on all of the responses instead of just some of them is more complex.
In a world where all stat-generation events were independent and asynchronous, you're right. However, the complexity comes into play when you consider that there is a distinct difference between stats that are generated internally to a project vs. ones that are external.
In other words, if we say that a project can be "ready for stat generation" after every team member has completed their retro and internal review, then we have a very clear and transparent requirement for stat generation, and clear events to point to. This makes it simple for teams to control when their project is complete: everything is internal.
If, however, the requirement for stat generation for a project includes external factors, then we enter into a very difficult territory. We have to decide: how many external reviews are required for a project to be "stat ready"? What happens if those reviews don't get submitted?
Now we have to worry about requiring that every project receive X number of external reviews before it can be considered stat ready, and everyone on the project is now left hanging until some number of randos complete their reviews, and there is no way for them to take control of the situation other than nagging people on chat to review their project so that they can earn their stats.
Realizing with the feedback from @bundacia and @jrob8577 that we need a better model of the timeline and state dependencies of our game: how it is now, and how we'd like it to be. Gonna see what I can come up with.
@tanner, that makes sense about the dependency on external reviewers. I guess I was still thinking in terms of cycles, where we could just say that you have until the cycle ends to review any project that was completed that cycle. Then we compute all stats at cycle end. Is there some reason that won't work? Are we planning on being even more dcoupled from cycles than that?
If it won't work, I guess we could calculate stats once the internal reviews are submitted, then recalculate the affected stats each time a new external review is submitted. An external review is just an event that affects your stats. That way we could use them in calculating stats without having to worry about them being ready at any particular time.
I guess this does all depend on how the the timeline and state dependencies will work.
I still think we're on different pages with the accuracy stat though. Is there a case where my bias stat could be +13% and my accuracy would not be 87%?
If it won't work, I guess we could calculate stats once the internal reviews are submitted, then recalculate the affected stats each time a new external review is submitted.
That's how I'd like it to work.
I still think we're on different pages with the accuracy stat though. Is there a case where my bias stat could be +13% and my accuracy would not be 87%?
Say you have three projects:
bias: +13% | accuracy: 87% (100 - abs(13))
bias: +39% | accuracy: 61% (100 - abs(39))
bias: -13% | accuracy: 87% (100 - abs(-13))
your aggregate/average bias of all three is +13%, but the avreage of all accuracies is 78.33%
@tannerwelsh commented on Wed Sep 21 2016
Strategic Goals this issues impacts
Benefits
What are the benefits of this change, and whom do they impact?
Description
Changes
Introduce two new review stats: Review Accuracy and Review Bias for each review dimension: quality and completeness.
Note that both rely upon the notion of internal and external reviews. An internal review is a project review completed by a member of the project, whereas external reviews are those completed by players who were not on the project team.
For the sake of simplicity and incremental changes, the first iteration of review stats will mimic how estimation accuracy/bias are calculated: compare the learner's score to the average of their teammate's scores.
Review Accuracy
Review accuracy is a measure of how close a player's internal reviews are to the mean of their teammates' reviews. There is a review accuracy stat for both quality and completeness.
It is calculated by running the following algorithm:
Review Bias
Review bias is a measure of whether a player tends towards under- or over-estimating the completeness and quality of their projects when compared to the mean of their teammates' reviews. It can be positive or negative. There is a review bias stat for both quality and completeness.
It is calculated by running the following algorithm:
Tasks / Sub-issues
playbook
game
/report/playerStats
CSV export ingame
Context
Right now there is no "downward pressure" on project reviews. Players are incentivized to inflate their own reviews, but there is no counterbalancing measure. As a learner, I also do not know if my reviews tend to over- or under-estimate, when compared with my teammates.
@tannerwelsh commented on Fri Oct 21 2016
ready for review @LearnersGuild/software
@bundacia commented on Mon Oct 24 2016
1.
This is the biggest thing that jumps out at me here, and I'm not sure I agree that it's ok as an MVP. I think the bias stat makes a lot of sense, but having an
reviewAccuracy
stat that is really just a measure of how much I agree with everyone else seems risky. Maybe too risky for an MVP.It can be gamed by collusion. It could encourage external reviewers to discuss their opinions before submitting reviews, which could cause players to be less honest in favor of going with the crowd for fear their own stats would suffer. Are we making forward progress if we're fixing one gameable stat with another?
2.
Should we split out my accuracy in rating my own work from my accuracy in rating other projects? If I rate 4-5 other projects every week those ratings will contribute a lot more to my reviewAccuracy than my own self-reviews. That being the case I could still inflate my own scores and "hide" it by reviewing lots of other project as accurately as I can.
3.
Same thing with the bias stat. Seems very likely that my bias could be in one direction for my own work, and in another direction for other teams' work.
@tannerwelsh commented on Tue Oct 25 2016
Good points @bundacia. I think these really should be focused on how well internal-review compares to external-review, excluding external-to-external comparisons.
Will update description...
@tannerwelsh commented on Tue Nov 01 2016
Moving back to mechanics after feedback from @shereefb
@tannerwelsh commented on Wed Nov 02 2016
Ok, after discussing with @shereefb we decided to take a smaller, more incremental step. Updated the description so that we are only comparing between internal reviews. Pretty much the same calculation as with Estimation Accuracy & Bias.
The reason for this is threefold:
The drawbacks are obvious: as @bundacia pointed out, it leaves room for collusion. As do many other stats. At this point, I'm of the opinion that we should try to bake anti-collusion measures into our stats if that means that the stats become excessively complex.
In other words, I'm using a strategy of "simpler/more understandable stats over defending against collusion". Hmm, maybe that should be a circle strategy... [goes to Glassfrog & Asana]
@tannerwelsh commented on Wed Nov 02 2016
Also, moved some previous details for a "more advanced review stat" to a new issue: #120
@tannerwelsh commented on Wed Nov 02 2016
Re-submitting for review @LearnersGuild/software
@bundacia commented on Thu Nov 03 2016
seems like "Review Accuracy" isn't telling me anything that Review Bias doesn't tell me. Review bias tells me how far off I am and in what direction, Review Accuracy just tells me how far off I am. Do we really need both? Seems like every stat we add cost us a little bit in how much the learner needs to understand, how much stace we need to display the stats in any UI, etc. Just want to make sure we're really getting value out of both of these.
@bundacia commented on Thu Nov 03 2016
My prediction is that stat will incentivize people to talk with their team to get a sense of how complete/correct the project is before filling out the retro. We'll pretty quickly see everyone's "Accuracy" get pretty close to perfect. I'm not really sure how this achieves this benefit:
Does this really create downward pressure? We're only comparing your review with other people with the exact same potential for upward bias.
It seems like we've made a decision that we cannot afford to wait untill all reviews are completed to calculate stats and I think that's the wrong way to go. I know we want quick feedback loops, but I don't think learners' learning is going to be greatly impaired by having to wait a few extra hours (or less?) for their stats. Especially if the price of a quicker response is a bunch of stats that are less valuable. Can't we just wait for all the reviews to come in and then compute stats that compare internal reviews with external reviews so that each player can see their bias when rating their own work and their bias when rating other players' work?
@tannerwelsh commented on Fri Nov 04 2016
thanks for extra feedback @bundacia. Replies:
Yes, I think we do. Because if we just had bias, it can obscure useful feedback. For example, if I'm off by -25% half the time and +25% the other half, then my bias is 0%. Without accuracy, it would seem that I'm all good, and doesn't show another side of the situation, which is that I'm consistently inaccurate, just in different directions.
Probably true. As mentioned, we're not attempting to solve the temptation towards collusion in this issue. That's a much larger issue, and no easy fixes.
There's another way to think of this though: the reward:cost ratio for honesty needs to be larger than the reward:cost ratio for collusion/dishonesty. In my mind, reviewing my own project honestly has a higher reward:cost ratio than not because:
It's less a matter of keeping feedback loops short (although I want to do that too) than it is avoiding over-complexifying the system. Right now our system is complex, overly interdependent, and incredibly hard to reason about.
Eventually, yes I think that external reviews will certainly need to play a role. But we are not there yet. Our system can't sustain that level of complexity and randomness at the moment, and it creates a sub-par experience for learners.
@bundacia commented on Fri Nov 04 2016
Maybe way the accuracy score gets calculated is wrong in the description then. Right now it says:
So if you're bias is 0% (because you're +25% half the time and -25% half the time) your accuracy will be 100%.
I'm not sure I agree that calculating stats based on all of the responses instead of just some of them is more complex. It's certainly not more complex from a code standpoint (my guess is it would be simpler). And it seems like "these stats are calculated based on all of the reviews" is simpler to understand than "these stats are calculated as soon as 'enough' reviews have been submitted". Maybe there's some type of complexity I'm not aware of in play here.
@jrob8577 commented on Fri Nov 04 2016
Just wanted to weigh in on the feedback loop - don't we use that to determine if folks are able to vote for different levels of projects (specifically the apprentice level/OSS projects, and eventually the LOS work)? Given that it's needed as an input to voting, if we don't have the correct stats by voting time, it feels like we should dispense with the voting constraints in other areas of the game?
@tannerwelsh commented on Tue Nov 08 2016
re: @bundacia
the accuracy calculation looks off, but it actually works out because the subtraction happens on a project-level, not an aggregate level. So you find the absolute value of the bias for each project, subtract it from 100 for each project to get project-level accuracy, and then take the average. Since each project's accuracy will likely be below 100, the average will almost always be below 100. The only way to get perfect accuracy is to have 0% bias every single time.
In a world where all stat-generation events were independent and asynchronous, you're right. However, the complexity comes into play when you consider that there is a distinct difference between stats that are generated internally to a project vs. ones that are external.
In other words, if we say that a project can be "ready for stat generation" after every team member has completed their retro and internal review, then we have a very clear and transparent requirement for stat generation, and clear events to point to. This makes it simple for teams to control when their project is complete: everything is internal.
If, however, the requirement for stat generation for a project includes external factors, then we enter into a very difficult territory. We have to decide: how many external reviews are required for a project to be "stat ready"? What happens if those reviews don't get submitted?
Now we have to worry about requiring that every project receive X number of external reviews before it can be considered stat ready, and everyone on the project is now left hanging until some number of randos complete their reviews, and there is no way for them to take control of the situation other than nagging people on chat to review their project so that they can earn their stats.
@tannerwelsh commented on Tue Nov 08 2016
Realizing with the feedback from @bundacia and @jrob8577 that we need a better model of the timeline and state dependencies of our game: how it is now, and how we'd like it to be. Gonna see what I can come up with.
@bundacia commented on Tue Nov 08 2016
@tanner, that makes sense about the dependency on external reviewers. I guess I was still thinking in terms of cycles, where we could just say that you have until the cycle ends to review any project that was completed that cycle. Then we compute all stats at cycle end. Is there some reason that won't work? Are we planning on being even more dcoupled from cycles than that?
If it won't work, I guess we could calculate stats once the internal reviews are submitted, then recalculate the affected stats each time a new external review is submitted. An external review is just an event that affects your stats. That way we could use them in calculating stats without having to worry about them being ready at any particular time.
I guess this does all depend on how the the timeline and state dependencies will work.
I still think we're on different pages with the accuracy stat though. Is there a case where my bias stat could be +13% and my accuracy would not be 87%?
@tannerwelsh commented on Tue Nov 08 2016
That's how I'd like it to work.
Say you have three projects:
your aggregate/average bias of all three is +13%, but the avreage of all accuracies is 78.33%
@bundacia commented on Wed Nov 09 2016
@tannerwelsh: Ah! I get it now! Thanks for hanging in there with me.
@tannerwelsh commented on Thu Nov 10 2016
RFI @jeffreywescott