There are little incentives to produce quality descriptions

code-423n4 / org

Code4rena Governance and Discussion

69 stars 17 forks source link

There are little incentives to produce quality descriptions #30

Closed dmitriia closed 1 year ago

dmitriia commented 2 years ago

Issue

Most elaborated descriptions are made front line and getting posted to the reports, but authors aren't currently compensated anyhow as duplicates have fixed equal weights. At the first glance why should they? It looks like the process is random, best description is getting publicity, while others aren't, the core process is finding the vulnerabilities, not reporting them.

But being repeated over this process skews the motivation of the wardens. Producing quality reports takes time and effort. Writing a good POC involve reasonable amount of coding and can take a whole day. During that time another issue can be found. Wardens who do not produce detailed descriptions win over time by adding extra issues over ones who did invest time to the writings. These efforts are not being compensated and become public good, adding value to the whole project by raising the quality of reports.

Imagine that all the wardens found out that 'spam hints' is an optimum behavior and adhere to it, i.e. there be no one for a free ride. The quality of reports will be degraded, they become unreadable for anyone who didn't investigate the code for several hours. Some valid issues will be not understood by the projects and declined as it's not always clear what is going on without proper details even for authors.

Proposal

What can be done? Currently front line issues are chosen by judges and this works well, the issue is that weights are equal for all the duplicates.

The ideal situation can be the grading along the curve within each issue as it's currently done for QA reports. I.e. front line issue automatically receives the 100 rating, while others are given some 0-100 score by the judge.

Current curve will work fine here as it's quite rare for an issue to be found by more than 20 wardens. This will handle well the situations of several quality descriptions for the same issue as, for example, 90 and 100 rated descriptions will be compensated nearly the same. This will also handle the core issue of having one good description and a number of short or inarticulate ones that still catch the main point. Ratings 10-20 and 100 will be paid quite differently. In the same time when there is no stark contrast between the findings it will not take much time from the judge to put say 80 or 90 ratings across the group, nearly replicating the current situation.

What do you think?

JustDravee commented 2 years ago

Encouraging good reports is good for the community too. I know that findings containing a POC and reaaally explaining the ins and outs are much easier to understand (and, in consequence, to learn from), than a finding saying something that's right but that's hardly understandable for beginners.

Competing in quality as much as competing in quantity would raise the platform's reputation too I guess, this is not to be underestimated.

The curve might make it unfair though, I'm not certain that if the range of scores is from 80 to 90, that it won't be exactly like if it was from 0 to 100. Doesn't the curve just compare wardens to each-other? I might be wrong here. But a warden earning twice the amount compared to another one, for a similar quality high-risk finding, might be unfair. If qualities are very different, then yes, absolutely pay double ^^.

Just my 0.02 USDC here. Loved reading this issue-proposal btw 👍

djb15 commented 2 years ago

I generally agree with this proposal, sounds like a sensible way to improve the quality of reports and dissuade wardens from submitting poor quality reports. In my opinion the latter is the thing that we should be trying to reduce as much as possible...if low quality reports are rewarded less than high quality reports, then the wardens who spend time crafting really great reports will be rewarded more in comparison already.

Rewarding on a curve sounds great in theory but I suspect judges will end up picking bands over time anyway rather than spending lots of time ranking all the reports. So my suggestion would be to have 4 bands for report quality: excellent, good, average and poor. Suggested score for each band:

Excellent: 100 Good: 80 Average: 50 Poor: 20

Ultimately finding a bug is the most important thing, but even a unique bug with a poor report quality should be rewarded appropriately on a scale. Also, there might be a tendency to reward long, detailed reports more than concise but descriptive reports depending on the personal preferences of the judge; longest doesn't always mean best. Provided the description of what make a "quality report" is made clear for judges I'd hope this should be ok.

EDIT - I see there was lots of discussion on the #wardens discord channel that overlaps with this reward bucket suggestion

gititGoro commented 2 years ago

I'm very much in favour of the bands alteration to the proposal. For the QA, I have a spreadsheet with a series of criteria such as documentation quality, importance of finding and so on. Each category has a 1-5 rating and the numbers have clear meaning. For instance, 1 - poor effort but explains the problem, 2 - explains the problem and links to code, 3 - links and provides POC etc.

Then I have a formula that weights the columns according to importance and produces a sum. The highest scoring warden is then weighted as a 100 and everyone else as a percentage of that and you get your curve.

This works well for QA because there are numerous issues reported, providing a wide variance in scores.

For med and high bugs, there is only one issue reported and I've noticed that very often, many warden's reports say almost the exact same thing. So separating out into unique scores out of 100 is not going to work. There will be a lot of clumping.

Right now the clumping is forced to be binary: The best report (original) and everyone else (duplicate).

Bands adds granularity to this distinction without forcing judges to assign arbitrarily fake numbers.

VAD37 commented 2 years ago

Both M/H grading curve and bucket solutions by @dmitriia and @djb15 are impractical. Sound like a good idea, but how are we going to grade quality of reports? Which one is better and by how much?

Does 2 hours report of first time warden with lengthy POC, full test script worth the same as5 minutes report of veteran repeat same exploit for the 20th times? Does spaming multiple angle of single exploit worth the same as a concise report?

It just simply hard to convert quality into points. Just a reminder that we throw away OWASP method because I am pretty sure no one was really using it.

Due to recent influx of new wardens with no guidance or experience, It is understandable that people want to draw the line and set up a standard. But pushing for filtering low quality, rewarding excellency is not the best way to go. We have no barrier of entry, new people will just spam more reports while lowering judge quality due to work overload.

Overall, the approach of @gititGoro is what I think the fairest right now.

VAD37 commented 2 years ago

From my personal experience, the best way to improving quality is putting focus on the best practice and feedback as accessible as possible. I can only learn what make a good report after the first time see the result and report after 1 month of first contest. The cycle of feedback/learning iteration simply too long. Meanwhile, I stumble in the dark for the whole month while churning out reports with different style to see what work. I am quite sure that I am also part of "low-quality" problem.

With better example for what are the expectation from wardens, or some of best practice that we like. That would help fast track lots of people make better report.

djb15 commented 2 years ago

The whole idea of having buckets/bands is a judge wouldn't have to decide exactly "how much" better or worse a report is than another one, rather is sed report similar in quality to another report, and are there any reports that are markedly better or worse.

Due to recent influx of new wardens with no guidance or experience, It is understandable that people want to draw the line and set up a standard. But pushing for filtering low quality, rewarding excellency is not the best way to go.

I think this proposal is more around keeping the best wardens engaged and motivated, so absolutely I think rewarding excellence is the right thing to do. But you're right, we shouldn't reward excellence without giving new wardens the resources and support to improve and become part of the top 10%...otherwise it's just an elitist club!

Provided the description of what make a "quality report" is made clear for judges I'd hope this should be ok

Quoting myself, I agree we should probably extend this to having a clear guide for wardens. Currently wardens are advised to read previous reports to see what makes a good quality report, but maybe it would be good to have a section in the documentation with sample reports at different quality levels? This way it would make it clear what the community/judges constitute as a high/low quality report (or if the band suggestion was used, a report for each band).

sockdrawermoney commented 2 years ago

I've been in conversation on this topic with Scott Lewis (who designed all the rest of Code4rena's mechanisms) and have been bringing everyone's concerns/ideas/suggestions into those discussions.

Here's what we landed on.

As @dmitriia suggests here, all issues will be graded on the 0 to 100 scale—just as QA and gas reports are—but also including medium and high severity findings as well. The 0 to 100 scale gives judges flexibility to have their own style. (Some prefer buckets, others prefer granularity; the 100 point scale and curve allows both.)
In addition to awarding on a curve, only 'passing' grades will be eligible to be included in awards. (Don't worry: before implementing this, we will be working on a rubric which outlining the threshold required for a passing grade.)
For judges, we'll be adding 0 to 100 labels in GitHub to make this grading task straightforward. (FYI we are also amidst creating a Chrome extension that will allow us to do all judging in GitHub and to get rid of the judge spreadsheet that duplicates work, complicates process, and creates more surface area for errors.)

CloudEllie commented 2 years ago

From @djb15:

Currently wardens are advised to read previous reports to see what makes a good quality report, but maybe it would be good to have a section in the documentation with sample reports at different quality levels? This way it would make it clear what the community/judges constitute as a high/low quality report (or if the band suggestion was used, a report for each band).

This strikes me as a really smart idea, for multiple reasons. C4 staff can support with this but I wonder if it would work better as a community-led project? Certified+ wardens are already involved in post-judging QA so would be well-equipped to identify good examples of reports.

0xSorryNotSorry commented 1 year ago

It's been nine months since the first recommendation was posted, and there are already some points not aligning with the original issue. C4 has a huge demand for participating in the contests and this makes a valid finding being found by numerous wardens (sometimes 50+). It leads to an undesired condition of paying 0,01$ to every warden that spotted the bug.

My recommendations actually kick in at this point. I didn't need to open another issue since it will cover the points here as well.

Proposal

I strongly believe that there should be a payout threshold for every individual valid finding. If a finding has 30+ spotters, and if the threshold remains greater than the individual payouts to the spotters, it should not be distributed to the wardens. In other words, C4 might consider not distributing 0,01$ to the wardens. On the contrary, the share of the valid finding is already known. It could be distributed to the top or top X wardens who handled the issue neatly with clear POCs and demonstrations.

TL:DR, The valuable spent time for a valid submission and a less worked-out submission should not receive the same payouts. I know the primary takes 30% more but I'm referring not diluting the payout for a particular finding. As C4 is not a bug bounty platform, our priority should be providing the best-chosen reports to the sponsors as stated in the first post of this issue. Carrying out this proposal can lead to;

Improved submissions of the well-known issues
Improved submissions of the bulk spotted issues
Receiving less number of Low-quality reports

So how much should a payout threshold be to incentivize the wardens?

kartoonjoy commented 1 year ago

Per the Autumn 2023 C4 Supreme Court verdicts, the Supreme Court's verdict on this issue is:

No change of rules is necessary. Partial scoring addresses incentives adequately.

Link to verdict: https://docs.google.com/document/d/1Y2wJVt0d2URv8Pptmo7JqNd0DuPk_qF9EPJAj3iSQiE/edit#heading=h.7jclxpy0if5s