codewars / codewars.com

Issue tracker for Codewars
https://www.codewars.com
BSD 2-Clause "Simplified" License
2.09k stars 219 forks source link

A detailed criticism on the satisfaction system #1166

Open Voileexperiments opened 6 years ago

Voileexperiments commented 6 years ago

This is a response to #1165, and a detailed tangent (maybe?) to #997 that I feel like would benefit to stand on its own.


I've tried to talk about this many times before, but I don't think jhoffner has got my idea yet:

Setting to expectation to be "everyone will vote very satisfied unless otherwise" is not a realistic measure at all. By how the current beta process and satisfaction metric are set up, it is set up as 0 points for 100% satisfied, while every somewhat satisfied and not satisfied deducts 1 and 2 points. This is not how ratings system work literally everywhere else on the internet:

https://xkcd.com/1098/

From the distributions of 5-star ratings I've seen on various places, when you have something good the distribution doesn't go like everyone's voting a 5: usually it's an exponential curve that begins at 5 and tails at 1. (Of course, there's another such exponential curve pointing at the other direction. For good stuff this part is negligible).

So even if a kata is, as jhoffner usually emphasizes, "high quality", it's not gonna get a pure 100%, i.e 5/5; it'd be more like 4.2/5. And this will get even lower as you throw more people into the rating system (I'm talking about at least a thousand votes, so blue/purple katas are out of the question). Just look at how many white/yellow katas that get approved and then converge to something like 80%. If you look at every single kata that's most completed, they've all converged to ~85%. (And the entry exam kata, Multiply, has just 75%.) By the dumb 90% requirement they wouldn't even have been approved at all.

And I (and maybe some other people too) have lots of problems about it because while nobody wants poor quality katas to get approved, we still want katas to be approved. What is high quality is subjective, but what is not hot garbage is mostly objective. We don't want a bunch of non-hot garbage beta katas staying beta just because they're like 4.3/5, and the dumb approval requirement says 90% satisfaction rating is required (i.e 4.6/5).

(This also doesn't mention another elephant in the room: beta process is usually participated first by the power users, so the initial statistics kinda matches what jhoffner expects, but not for the expected reason. It doesn't work in general. Basically, the beta process somehow works by some amount of luck.)


In the end, we probably should separate "katas that are hot garbage" from "katas that are not hot garbage but not stellar". We can all agree that we should not approving hot garbage katas and we should approve stellar katas, but still we have tons of katas inbetween (mostly in the white/yellow range), and the various discussions in the past doesn't address these katas at all. And the satisfaction system doesn't help at all either, instead it just puts these katas dangling between the boundary of approving and not approving (which is arguably worse, now it depends on whether a somewhat satisfied will come out randomly from somewhere, or nowhere). We need a system that focuses and addresses these katas currently standing on the verge.

┆Issue is synchronized with this Clickup by Unito

jhoffner commented 6 years ago

I hear you and I don't argue that the system needs to be improved, however we need to go based off of data, determine what the key issues are, and provide solutions.

So, what is the problem, defined concisely and backed up with data, and what is a proposed solution? That would be the ideal format to begin this discussion.

For now I will try to cover what I can from what you wrote.

it is set up as 0 points for 100% satisfied

It is currently 0% for 0 votes. If the first vote was a not-satisfied, it would be 0% of 1, if somewhat it would be 50% of 1, if very satisfied it would be 100% of 1.

while every somewhat satisfied and not satisfied deducts 1 and 2 points

There are no points and nothing is being deducted. We count the number of positive votes and divide them by the total number of votes. If 4 people voted "very satisfied" and one voted "somewhat", it would be 4.5 / 5 = 90%. If 4 people voted "very satisfied" and one voted "very unsatisfied" it would be 4 / 5 = 80%.

And I (and maybe some other people too) have lots of problems about it because while nobody wants poor quality katas to get approved, we still want katas to be approved.

I just did a check, out of ~2k kata that are currently in the "needs testing" state, only 9% of them currently have reached the minimum vote count and have no open issues, meaning the only thing left holding them back is their low satisfaction score. I would love to see examples, actual kata, that have been held back by the satisfaction score benchmark so that we can start to break down the issues of what happened. As the data stands now, 9% doesn't scream complete bottleneck and of that 9%, it still has to be determined how many of them are actually good kata that got grouped into the poor ones.

Of course that 9% is just those only held back. There may be other kata that have open issues but also stand little chance of ever making it back to redemption due to a low satisfaction score.

If you look at every single kata that's most completed, they've all converged to ~85%. (And the entry exam kata, Multiply, has just 75%.) By the dumb 90% requirement they wouldn't even have been approved at all.

There are a number of factors going on here which makes these kata completely irrelevant for comparison.

"Multiply" is even worse of an example, as it is only meant for signing up. If it was a kata being introduced today, it should hopefully be much lower than a 75% as it isn't an actual good kata for once you are passed signing up. By normal CW standards, it's steaming hot crap (but works well for the landing page format).

In the end, we probably should separate "katas that are hot garbage" from "katas that are not hot garbage but not stellar".

I think lowering the bar a little bit makes sense. 90% may be too high for white kata. It is pretty hard to get back to 90% once you get 1 negative vote and its a very small margin of error.


A list of some issues with the current system

😢 Once a kata has a very poor initial reception, it's basically impossible to get its rating back up.

Movie ratings make sense because once the movie is release its done. Kata can be iterated on and changed. I think the fix here is to simply restart the beta process for the kata. If it has a poor rating, then it probably should be rethought and a v2.0 should be released. How far the system goes to facilitate this process vs just having someone unpublish and copy their challenge to a new one is debatable, but doing it manually is a workaround that works for now.

😢 The rating system does not take into account bad actors

Bad actors being those who tend to downvote the same user, who are new to the site, never complete a kata they downvote, etc.

😢 Rating characteristics changes post-approval, due to a larger audience now rating the kata.

The system may need to start distinguishing pre/post approval ratings and tracking this information. This combined with other characteristics (such as rank level, power user, etc), the rating algorithm could be updated to consider these factors.

😢 The margin of error for reaching the required satisfaction score, especially for white kata, is too great.

I would like to run some numbers first, maybe look at some kata that are good but didn't make the cut, to try to find a good range, instead of just arbitrarily selected one (like done initially due to lack of data). With that said, moving it down to at least 85%, maybe 80% probably makes sense.


Side note, it would be very interesting to see everyone with an opinion on this to create their own kata with their own requirements & expected algorithm for calculating the score - as an experiment in collaboration.

docgunthrop commented 6 years ago

It is currently 0% for 0 votes. If the first vote was a not-satisfied, it would be 0% of 1, if somewhat it would be 50% of 1, if very satisfied it would be 100% of 1.

This is certainly not the case. As I mentioned in #1165 , some random troll voted on all my published katas in one sweep with a "somewhat satisfied" rating. For my kata at https://www.codewars.com/kata/spidey-swings-across-town/javascript it's now currently rated 0% of 1 with a "somewhat satisfied" vote.

It was mentioned (though I hadn't personally confirmed it) that a user needs to at least make a requisite number of attempts on a kata (which the user did not do) before they can even give a satisfaction rating. If this isn't the case, then perhaps it should be?

Voileexperiments commented 6 years ago

@jhoffner

It is currently 0% for 0 votes. If the first vote was a not-satisfied, it would be 0% of 1, if somewhat it would be 50% of 1, if very satisfied it would be 100% of 1. There are no points and nothing is being deducted. We count the number of positive votes and divide them by the total number of votes. If 4 people voted "very satisfied" and one voted "somewhat", it would be 4.5 / 5 = 90%. If 4 people voted "very satisfied" and one voted "very unsatisfied" it would be 4 / 5 = 80%.

Yes, but my point is that that's how the system practically works. Re-labeling/re-leveling doesn't change how the underlying system behaves. That's like saying turning a 5/10-star system into a like-dislike system will drastically change the distribution.

As the data stands now, 9% doesn't scream complete bottleneck and of that 9%, it still has to be determined how many of them are actually good kata that got grouped into the poor ones.

We still haven't retired any of the worst/duplicate beta katas that should've been retired a long time ago, so the denominator is much, much larger than it should be (not to mention lots of bogus issues, I probably have resolved the most issues on CW, except maybe compared to people like g964).


But yes, IMO I don't think 90% is the best cut (it translates to "please don't get a somewhat satisfied out of a 12 solver streak").

Voileexperiments commented 6 years ago

Also, @docgunthrop don't forget that compared to approved katas, you can vote for satisfaction rating on a beta kata once you attempted the actual tests/forfeited. It's not unknown for people to get too frustrated on a very difficult kata, and then give a bad vote after a long time/forfeited to see the solution.

I personally can't see why allowing beta katas to be voted before a solve would do anything - you still have ranking feedback as an approval criteria anyway, and letting people vote before solving a kata is just not a good idea. Everything is stupidly hard and impossible if you don't know how to solve it, after all.

docgunthrop commented 6 years ago

We're on the same page, @Voileexperiments . Allowing satisfaction and ranking feedback after completion of a kata seems to make more sense. As it currently stands, it's terribly easy for any newbie to give negative feedback out of frustration from not being able to solve a kata, as well as for any troll to indiscriminately give negative ratings to katas just because they can.

If a user has an issue with a kata, they should use the Discourse section to ask questions or seek answers; at least there they have a chance to communicate with the author and other users, whereas a hit-and-run downvote accomplishes nothing. And though I've seen users complain about the most inane things in the Discourse sections of many katas, at least that doesn't stall progress on approval for a decent one (as issues can be resolved).

anter69 commented 6 years ago

My 2 cents:

These suggestions mainly concern harder katas (6 and below) that accumulate votes rather slowly, but could/should be used in general.

@jhoffner: probably you could run some numbers to see what would happen if you remove votes from users that raised an issue and the issue was resolved since

anter69 commented 6 years ago

Related issue: #1138

ghost commented 6 years ago

I just did a check, out of ~2k kata that are currently in the "needs testing" state, only 9% of them currently have reached the minimum vote count and have no open issues, meaning the only thing left holding them back is their low satisfaction score.

9% doesn't scream complete bottleneck

@jhoffner I don't think this is a good estimate of whether the satisfaction score is a cause of a bottleneck. Neither the numerator or denominator are useful estimates for @Voileexperiments stated concern, which I would reword as: "it's hard for good kata to get out of beta, and the satisfaction rating isn't helping". Instead of total kata held back due to satisfaction rating/ total needs testing, you need total good kata held back due to satisfaction over total good kata that need testing. Since the total number of kata held back due to satisfaction is a significant number (~200, vs. 5), you're way out on a limb using that statistic to estimate the bottleneck for good kata, and you just end up making assumptions without data about whether there would be more, less, or the same proportion of good kata in the numerator vs. denominator.

ghost commented 6 years ago

@jhoffner , further, I know you want data to support the concern, but good data here are very expensive. It's a problem of measurement. Again, to evaluate @Voileexperiments 's stated concern, "it's hard for good kata to get out of beta and the satisfaction rating isn't helping", you need to identify good beta kata. To do that, you need a valid measure of goodness, which you don't have. I'm not even sure you have a consensus construct of goodness. But it seems clear that your users that contribute most to codewars believe there is a problem with the beta process (though they may not all agree on what it is). You're either going to have to accept a cheaper form of data (issue posts and user opinion) or create a measurement that can produce the kind of data you're looking for.

ghost commented 6 years ago

fwiw, here's my full statement of the problem and a proposed solution

You're not measuring either satisfaction or quality

Having spent some time working in psychometrics, I'm not surprised that you've run into a problem porting a system that measures quality (no issues, minor issues, major issues) to a system that measures satisfaction. These are entirely different constructs, though they can be associated. In addition, you have a punitive satisfaction measurement system. As a rule, satisfaction measurements that are used to produce something other than general knowledge of satisfaction (i.e., that have side effects) are imprecise and inaccurate measures of the actual construct of user satisfaction. The problem is amplified if the side effects are punitive. Codewars appears to fall in line with most cases. Based on discussions here, it doesn't seem to be able to discriminate well (not a lot of spread in the measurement), and people vote based on what they hope the result of their vote will be (kata author will gain or lose honor, kata will or won't get out of beta, etc. etc), rather than based on whether they are satisfied. Because of these problems, some people are concerned that people are downvoting too much or for improper reasons, and some people are concerned that people are not downvoting enough. This is entirely predictable.

How to fix the satisfaction measurement

Satisfaction is a useful construct to measure, and it sounds like it's something @jhoffner really values. It seems like you would like to use your measurement to allow users to find katas they're more likely to enjoy. To do this well, the vote needs to be anonymous (check), reward the user for voting (check), not be required (check), and not have side effects (...). If you really want to know how satisfied users are, you should ONLY use that measurement to help the average user find kata s/he's more likely to enjoy. Do not give the author a token or real reward for satisfied votes, certainly don't remove a token or real reward for unsatisfied votes, and don't use satisfaction as a threshold for beta approval. You can adjust for the resulting change in the token economy by increasing what you give the author for kata completion by a user.

Quality vs. Satisfaction for beta approval

I believe you are better off measuring quality, rather than satisfaction, as a requirement for kata approval. Beta kata are, by definition, not in a fixed state and are seen and experienced by a very specific kind of user with different expectations and biases than the larger user base. So not only do you compromise the relationship between your measurement and the construct of satisfaction by using it for beta approval, you are using a smaller group of systematically different users to measure a different object, now breaking as many of the rules for good measurement practice as you can :). This is not to say that there aren't successful, good practice methods for measuring satisfaction for a beta product, but I hope you can see how they are not particularly viable here, and how their results are not used to evaluate the later complete product.

Measuring quality

Clearly, though, you want some standards for a beta kata, but again, I think the standards you need are about quality. You already have two measures of quality in place (issues, and moderator review), they just need some improvement, and some work by the community to come to a consensus. As far as the measurement methods are concerned (public comment and expert review), you're right on target given the number of users who typically see a beta kata before it is approved. So, good news for @jhoffner, it can be done without new functionality! Ideally, here's how you would do it (it's not as hard as it sounds).

  1. Identify the attributes of a kata quality, as a community. The best practices stub might be a place to start, but it is missing several things that the community clearly values, i.e., test coverage.
  2. Identify the standards for those attributesl. i.e., for the test coverage attribute, "includes random tests" might be a required standard. For clarity and style, "uses markdown" might be a standard. Since we're using public comment, and ultimately, expert review, they do not all need to be objective. But they should include guidelines (i.e., if an attribute is novelty, and a standard is not a duplicate, guidelines would define a duplicate -- is it that a copy and pasted result passes both kata? -- that some other kata uses a similar concept? -- somewhere in between?
  3. Decide how to use the standards. Here is where you can decide to add functionality if you want. A simple version might be to just list the required standards on the wiki, and have moderators document pass or fail for each requirement in a comment. A more complex version that would require added functionality might include transforming the standards to a numeric rating and storing those ratings as part of the database for users or admin to use
  4. Communicate the standards, both to moderators and to authors
  5. Implement

If this is something the community thinks would be beneficial, and if @jhoffner would consider using it, I'd be happy to help.

Blind4Basics commented 6 years ago

Hi,

Additional note: votes are NOT really anonymous since one can see who completed the kata. Except if the completions are too near in time, it's easy to deduce who voted what. And I don't see a way to avoid that for now.

@mentalplex : real qualitative answer and concern. 'Hope we'll find a way through all of this. :) 👍

jhoffner commented 6 years ago

Thanks @mentalplex for the thoughtful analysis.

The recent updates to the beta process help serve as a stop-gap solution for now, but a complete rethink of the process is I think what will eventually happen. This was mentioned some months back and I still think it might be the best approach, which is to combine satisfaction and issue reporting into one system, that acts as a survey.

This is a rough sketch, but something like:

I think that would take most of the ambiguity out of the process.

Some things that would need to be considered before this could be implemented:

Obviously this approach would be a huge project.

ghost commented 6 years ago

@jhoffner I appreciate the thorough reply. You've obviously been putting a good deal of thought into the best way to assess beta kata.

I think you can keep things fairly simple if you start the approach with a clear statement of the goals for reworking beta (and perhaps all) kata assessment. There's a lot that could be done here, for both codewars and Qualified, but you can end up wasting a lot of time re-inventing the wheel.

jhoffner commented 6 years ago

Good point @mentalplex.

My goals for a re-design are:

Ultimate Goal

General Goals

Some specific goals

This is a working start, I'll probably revise this.

jhoffner commented 6 years ago

Also, I've been thinking through description/instruction best practices, mainly for Qualified but I would like to get some for Codewars as well. Codewars is going to have more of a fun tone than Qualified's content but I think there should still be a lot of overlap.

My initial starting point here is to get that more formalized and up on the wiki. I've created a new issue if anyone wants to collaborate on what should be in there.

dolamroth commented 3 years ago

What about going with several scales at the same time? Like,

Choosing appropriate tags while in beta would also be nice.