DadeKuma commented 1 year ago

TL;DR Update - 30/Oct/2023

Given the length of this thread, and how the process evolved in the meantime since the original thread was started, these are the current expectations from the judges:

Final Rank

>= 80% Rank A
>= 60% Rank B
< 60% Rank C

Score

For each valid instance:

H +20
M +10
L +5
NC +2
GAS-L +5
GAS +1

Penalties

Subjective issues that are considered invalid should not be penalized, but they can be ignored. If not sure, an issue should be ignored, and not penalized.

For each invalid instance:

H -20
M -10
L -5
NC -2
GAS-L -5
GAS -1

Disputed Issues

This section should not be awarded or penalized, but it can be used while judging. It is recommended to double-check a disputed finding before penalizing the referred one, as the former may be invalid.

Post-Judging

It's expected that the judge will share a judging sheet with detailed information about the winner's score (e.g. this report) The judge should also send each bot crew their respective detailed judging sheet when asked on DMs.

Original Thread

It is really important to have a coherent standard between bot races, and this proposal is an attempt to define some guidelines for the judges.

Reports are starting to be very long, and so it is starting to be very hard to judge if the main repository contains lots of lines: sometimes it's simply not possible to review every issue for every bot, as it would result in millions (!) of lines to be read by the judge in less than 24h.

I propose the following phases that should be handled by the judge in every bot race.

Phases:

Risk alignment (first pass)
Issue sampling (second pass)
Report structure analysis
Scoring

1. Risk alignment

The main goal of this phase is grouping/clustering the risk of each issue by looking at its title: this is to ensure that the next phase is fair.

If a bot crew submits an issue with a wrong risk rating, it should be upgraded/downgraded accordingly. If this happens multiple times on the same issue, the judge MAY issue a warning. If this happens again after the warning, the issue is treated as invalid.
If two bot crews submit the same issue with a different risk rating, the judge should apply the same risk rating to every duplicated issue.

GAS issues should be marked as [GAS-L, GAS-R, GAS-NC] by the bot crews with the following criteria, based on the optimization:

GAS-L: storage or >1k gas
GAS-R: calldata/memory or 100-1k gas
GAS-NC: everything else

2. Issue sampling

We should decide on a specific % of issues that should be sampled in each category, with a minimum of m and a maximum of M issues to be reviewed.

These numbers (m and M) should be constant for every race, and they should be chosen so that it is possible to judge any contest in less than 24h.

The judge will review a max of M * categories and a minimum of min(largest_report_issues, m * categories) issues, where categories = [H, M, L, R, NC] = 5 and largest_report_issues = total number of issues of the largest report, per report.

For each sampled issue, the judge should leave a comment choosing from:

A (valid)
B (valid, but with false positives with range (1 - 20)%)
C (invalid, plain wrong, or false positives > 20%)

The judge should pay attention to the number of instances if this issue has duplicates. If this number is less than 50% of the number of the best duplicate, it should be penalized by reducing the rank, once (e.g. A -> B, or B -> C).

This is to avoid bots that submit just a single instance of an issue to avoid false positive penalties.

3. Report structure analysis

This step ensures that the report is well-written and readable. A report with bad formatting is not very usable, so reports should also focus on structure quality.

This is a subjective judgment, but it is split into categories to ensure fairness.

The judge should review the following, giving a score between A and C:

Structure

Good examples: follows the table of content best practices, the report flow is easy to follow

Bad examples: issues are put in random risk order, lots of links/references, no formatting

Content Quality

Good examples: Good grammar, good descriptions, conciseness

Bad examples: Wall of text, fails to correctly explain the issue's impact, zero references for non-obvious issues

Specialization

Good examples: able to understand the repository context, able to describe an impact for the specific protocol

Bad examples: generic descriptions of issues, generic issues that are technically valid but are impossible to occur due to how the project work

4. Scoring

The main goal for the bot race is to generate a report for out-scoping common issues: quantity and quality should be the main criteria, while risk rating should be secondary.

Main reason is to avoid having a report winner that finds multiple high issues but almost zero low/NC.

For these reasons, risk rating should be flat instead of percentage based.

Main scoring formula:

    score = 0
    for each risk_category:
        score += issue_rating_avg * risk_category_qty * risk_category_multiplier
    score += report_structure_score

issue_rating_avg formula:

    score = 0
    for each sampled_issue:
        if judging A:
            local_score = 1
        if judging B:
            local_score = 0.5
        if judging C:
            local_score = 0
        score += local_score
    score /= sampled_issue_qty

risk_category_multiplier formula:

    if risk H:
        score = 9
    if risk M:
        score = 7
    if risk L:
        score = 5
    if risk R:
        score = 3
    if risk NC:
        score = 1

report_structure_score formula:

    score = 0
    for each report_criteria:
        if judging A:
            local_score = 7
        if judging B:
            local_score = 4
        if judging C:
            local_score = 0
        score += local_score

The bot score is then applied to a curve, to give the final score that will be used to calculate the rewards.

Winner: highest score A: at least 80% of the winner's score B: at least 60% of the winner's score C: at least 40% of the winner's score D: less than 40% of the winner's score

Rate C and D get zero rewards.

A bot is disqualified (can't compete in races until it is qualified again) if any of these conditions occur:

Single D grade
Does not participate in 2 races in a row

trust1995 commented 1 year ago

Overall I like this proposal a lot and would look to use a tool which helps streamline this process. I have a problem with (1) - this step undermines the sampling at (2) and to do properly, will take a prohibitively large amount of time. It is simply infeasable to perform over all submissions of all reports. Instead, it should be baked into sampling, where judges need to deeply look at the issue anyway. If it's risk is misaligned, it will penalize the entire category.

Applying a curve seems like a very good idea. We'd need to carefully consider the D->C % threshold, as I'm concerned a very strong #1 report it could DQ an unintended number of racers.

IllIllI000 commented 1 year ago

Thanks for writing this up. I like it.

I agree that 1 is not going to work when there are a lot of issues. For the sampled ones, if the issue didn't get an 'A', the judge should be required to give specific reasons why, and I think that'll take care of things over time.

One thing I'd suggest is to not allow the judge to choose which issues to sample. There could be an algorithm where for each submission, the judge just has to put in the agreed upon seed for an RNG (e.g. timestamp of first submission markdown file) along with "#H,#M,#L,#R,#N,#GAS-L,#GAS-R,#GAS-NC", and a a script spits out which issues need to be looked at, so everyone is at the mercy of randomness, which makes it a little more fair and lets the judge focus on the judging of the issues. If it's mechanical like this, we can go back through the prior races and see if we agree with the output, based on Alex's scorings, before applying it live.

We currently are assigning H,M,L,NC ourselves, but not R. If we go with sampling and grading based on the categories, we'd all have to tag each finding using all of the tags.

I'm not sure that D needs to be separate from C since one does not get bumped out unless the person in the qualifier got an A or B: https://discord.com/channels/810916927919620096/1093914558776758403/1106173876527960064

trust1995 commented 1 year ago

Having a script with an unpredictable seed which is verifiable post-contest is a great idea, actually talked about it privately with some folks and this makes sense.
I thinking merging NC and R also makes sense as the gap between them is subtle and just adds another layer of sifting for everyone.
I'm not aware of the qualifier ratings, how does that work though? The judge doesn't handle the qualifier
Verifying the algorithm with Alex's scoresheet as benchmark is probably not the right way to do it. The result won't tell us anything as there's built-in precision loss with random sampling. Also if the results are different, that won't mean that one set of results is better than the other. What does make sense is that we feed some parameters on previous contests, see what the score would have looked like, and vote for the best parameter set. Naturally there's still subjective judging to do in regards to classifying quality and valid/invalid.

IllIllI000 commented 1 year ago

I'm ambivalent. It definitely is a factor for smaller contests when there are fewer findings
I believe the judge does the qualifier too, but I don't have any information on what has gone on so far, except that only one team has been promoted so far
valid/invalid was what I was saying we'd use Alex's scoring sheet for, since he scored every submission for the early contests

DadeKuma commented 1 year ago

Overall I like this proposal a lot and would look to use a tool which helps streamline this process. I have a problem with (1) - this step undermines the sampling at (2) and to do properly, will take a prohibitively large amount of time. It is simply infeasable to perform over all submissions of all reports. Instead, it should be baked into sampling, where judges need to deeply look at the issue anyway. If it's risk is misaligned, it will penalize the entire category.

Applying a curve seems like a very good idea. We'd need to carefully consider the D->C % threshold, as I'm concerned a very strong #1 report it could DQ an unintended number of racers.

In hindsight, it may be infeasible with large codebases. However, the main problem is this scenario:

an issue is sampled from a report, but it's upgraded/downgraded by the judge
it has duplicates in other reports
other report samples don't include this issue, so they are judged with their original risk rating

This seems very unfair, especially for H/M issues as the value is higher. How would you solve this problem?

Another solution could be to standardize the issue titles, so grouping is way easier and can be partially/fully automated.

IllIllI000 commented 1 year ago

maybe count downgrades as invalid and don't do upgrades, forcing the work on the bot owner for future races to get it right.

GalloDaSballo commented 1 year ago

Pre-Preface Why are we here?

This initiative was done to reduce the tedium of judging QAs and Gas reports for contests

While appreciating the effort done by bot makers

IMO the ideal outcome of Bot Races is that "Spammy QAs" are done by Bots, and all "Creative QAs" are done by Wardens

Preface - Keep It Simple

I appreciate that you want to define standards for classification but I believe the suggested solution is overly complicated and adds to much cognitive load to a judge

I would suggest a simplified system, which has served me well over the past 2 years

The idea is to do the least amount of work to make sure that the ranking is properly done (greedy algorithm)

And it's based on math

Criteria for Judging

All reports are judged based on their findings, plus or minus a discretionary bonus

Discretionary Bonus

The discretionary bonus is necessary in case of ties, and to add the ability to rate "intangibles", for example a POC for a High severity or similar

The idea of enforcing a rule without knowing its implications is destined to fail, so I would not put a value to such a discretionary bonus and would leave it vague on purpose.

Rating by Severities

Each finding may have a different value, the idea of enforcing such values in rules is also limiting

That said, in practice, very rarely bot reports differ in the contents of a specific finding (e.g. address(0) checks), and the idea of enforcing or counting instances at this time is effectively impossible

For these reasons, I have used a reliable rating that has served me well:

L - Low Severity / Good Finding / Good Gotcha / Interesting => 5 points
R - Refactoring / Noticeable => 2 points
NC - Non-Critical / Information / Minor => 1 point
I - Ignored / Lame / Padding => 0 points

I would then add detractions or bonuses, in a discretionary way.

Lack of points for Meds and Highs

While I have scored HMs in Bot Reports, I don't think it's appropriate to "limit them" by using specific rules, it may be best, at this time, to use a more discretionary rating based on the importance and quality of the finding.

e.g. ERC4626 being rebased is basically a $0 finding, while others may be more valuable

Ultimately if the Staff starts awarding HMs by putting them in the main pot, this whole issue will be solved in a more appropriate way rather than getting into the absurd scenario of having to compare a report comprised of High only and one comprised of hundreds of Lows

In the meantime, ossifying a ruleset with a moving target sounds to me like a even bigger setup for disappointment.

Anyhow, after agreeing on SOME way to score different findings relative to each other (a Legend if you will), the Judge can then proceed to apply the Legend to Pre-Filtering

Step 1 - Pre-filtering

Filtering is about determining if a report will make it, I doubt anyone can do this off the top of their head the first time, they can try, but they should be open to the idea that they will have to judge all issues.

Pre-filtering algorithm

Each bots should have their Highest Impact Findings Judged first, these will score the highest, a lack of High Impact findings signifies a High Likelihood that the Bot will not make it.

A simple algorithm to judge in Pre-Filter Phase is: FACE_VALUE_SCORE = SCORE(Top Findings) + AT_FACE_VALUE(Rest of Findings) - Obvious Mistakes + BONUS* *Where BONUS can be negative

Some bots will not pass this basic scrutiny, this process SHOULD offer them the highest chance to make it, and should make it obvious that they won't (since we are awarding the other findings as valid, hence that's a optimistic score)

Step 2- Grinding it out

Phase 2 consists of Judging all bots until enough of a difference was found to determine relative rankings

Technically speaking, rankings can be defined as the % score in relation to a Top Score (typically the Best bot, sometimes a value that resembles it, adjusted by the Judge at their discretion)

In order to Grind It Out, I would recommend the Judge to pick a few of the Top Candidates and Fully Judge them

Alternatively they could judge each bot a little bit a time, while maintaining the heuristic for FACE_VALUE for the remainder of the findings, eventually the math checks out.

Step 3 - Tie breakers

The issue that we face at this time is when a Tie Breaker happens, which may be immutable to the Judges own mistakes, lack of tooling, lack of time, etc..

For those scenarios, the rules should be kept lax, as to allow them to be handled in a way that is fair to the participants

An example was my request to have 3 winners as the top 3 bots were separated by 1 point each, which I believe was the fairer decision

In Conclusion

I think Judges' Criteria should be maintained more subjective than what was proposed

However, Judging Process should follow the recommendation I made, as a way to prove that the Judge themselves have followed a clear, demonstrable procedure, and in doing so were fair to the Bot Makers

The Future

If we agree on this, then we can find ways to make Judges job faster, while keeping in mind that Judges are necessary for perhaps 20% of the process, and the remaining 80% is just there because the process is not automated yet

IllIllI000 commented 1 year ago

You've always done an amazing job at both the Gas and QA, and now bot race scoring; you look at and score every issue (at least up to the point where it will make a difference in the final outcome). In my experience though, most other judges don't put in this amount of effort, and just go with a gut reaction or minimal sampling, without digging into any of the details. For example, I put a ton of time into filtering out invalid findings and things mentioned in the known issues section, but I'm not sure that everyone does that, and I don't think most judges notice the difference either. If all judges were forced to be as rigorous as you are, I don't think we'd have any issue with that, but because there is variation among judges, how would you handle cases where the judge appears to be being relying on gut more than metrics? I believe the rules DadeKuma came up with above are an attempt to force judges to have a threshold minimum amount of effort we all agree is fair, given time constraints.

GalloDaSballo commented 1 year ago

You've always done an amazing job at both the Gas and QA, and now bot race scoring; you look at and score every issue (at least up to the point where it will make a difference in the final outcome). In my experience though, most other judges don't put in this amount of effort, and just go with a gut reaction or minimal sampling, without digging into any of the details. For example, I put a ton of time into filtering out invalid findings and things mentioned in the known issues section, but I'm not sure that everyone does that, and I don't think most judges notice the difference either. If all judges were forced to be as rigorous as you are, I don't think we'd have any issue with that, but because there is variation among judges, how would you handle cases where the judge appears to be being relying on gut more than metrics?

There are multiple layers that I have to address for your question:

Can we agree on the Process?
If yes, how do we make it happen?

Can we agree on the Process?

We can agree on this only if we start by agreeing that "fully judging the findings until lack of reasonable doubt" is the only fair approach.

If we don't agree on this, then nothing after would matter

If we do agree though, then we can solve this as a problem of workload / tech.

If yes, how do we make it happen?

Ultimately there are ways to diff-judge bots that don't require judging all of them, especially when most bots are very close to each other. If we agree on the end-goal, then we can work on tools (which I have built for myself btw, so I know what I'm talking about), that can be used to avoid the grind.

On Judge Pay / Workload

Judges are considered Highly Impartial Experts, pay is commensurate to that

That's great and I am grateful for the opportunity But 80% of the work (maybe even 99%) in judging Bot Races could be delegated to an assistant, without any loss of value

The Judge would spend (at most) an hour judging the "salient findings", the Assistant would "grind it out, for experience" and then the Judge would review the process at the end.

Replace "assistant" with "auto-diff tool" and the process can be performed very rapidly, as long as we agree that that's the process that should be followed.

On Gut Instinct

I believe that even if someone else applied different weights or categories, perhaps a Critical vs High vs Med vs Low type classification, it would still end up being mostly fair and mostly positive in the long run

I would rather see the process follow and it breaking than discuss the worst case scenario of not following it

On The Long Term

Long term, bots should get rid of the "crap" that is sent in QAs, so the true worst case scenario for Bot Races is that each Bot contributes their own unique High Severity and they are all scored differently, making it impossible to have a list of Known Issues

The above process achieves that, and tries to mitigate the issue of Judges having to crunch

If we frame it this way, I think we can make it happen

IllIllI000 commented 1 year ago

We can agree on this only if we start by agreeing that "fully judging the findings until lack of reasonable doubt" is the only fair approach. ... If we do agree though, then we can solve this as a problem of workload / tech.

100% agree. QA/Gas scoring in the normal contests hasn't moved towards this, so I assumed it wasn't a feasible goal. I'm totally on board if you can make it happen

GalloDaSballo commented 1 year ago

We can agree on this only if we start by agreeing that "fully judging the findings until lack of reasonable doubt" is the only fair approach. ... If we do agree though, then we can solve this as a problem of workload / tech.

100% agree. QA/Gas scoring in the normal contests hasn't moved towards this, so I assumed it wasn't a feasible goal. I'm totally on board if you can make it happen

I would say that from my experience most new judges have used a verifiable method, it's just that the overhead of typing it, etc.. is crazy (copy pasting headers, etc.. becomes heavy if you do it 1k times a week)

ChaseTheLight01 commented 1 year ago

If all repetitive and time consuming tasks such as copying headers can be identified and requirements to alleviate them are drawn out. I am more than happy to volunteer to build a toolkit to make judging easier and more efficient for judges going forward. I imagine this may require some standardisation of report layout to work.

Picodes commented 1 year ago

I 100% agree with the need for standardization so crews have something to optimize for. However, I'd like to point out that, quoting the C4 official doc: "The Lookout assigned to an audit will be responsible for reviewing and grading all Bot Race submissions within 1 hour of the Bot Race (i.e. within 2 hours after the audit launches). "

So the lookout is asked:

to review ~20 reports with 1k+ instances per report in a few hours
for a codebase he doesn't know anything about as the code was just disclosed
knowing that crews all copy one another so it's more and more difficult to differentiate as time goes

So wouldn't it be better to:

reduce the number of bot crews
make all bot reports public and out of scope: currently, findings that are not in the winning report are in scope, so they are submitted again during GAS and QA, which is really weird as the sponsor already "paid" for them.
give more time for grading so we can also hear the sponsor's opinion and we can do it when Lookouts know the codebase

Finally, do we really need to rank bots at every contest? By doing this we're highly incentivizing crews to run their bot and do some manual filtering to remove some false positives and improve the quality of their bots. But the idea of bot races should be to reach a point where proprietary bots could be run by the team and the whole thing could be automated. In addition to this, it's a bit unfair for crews whose timezone is hardly compatible with contest launches. We could grade them once in a while, during "qualifiers", and then every bot participating in a race would be paid according to its qualifying grade.

IllIllI000 commented 1 year ago

If only ranking sometimes, then you won't see any new rules from anyone until the next ranking race. Bots will also be more luck based (which judge and which contest happens to match that bot's rules best that one time). If you end up dropping rewards due to the rule withholding behavior/interval between ranking, then people will stop improving their bots and stop racing at a certain level of complexity, since the time investment won't match the rewards. The goal is preventing lots of wardens from submitting the same common or easy to locate findings, which itself will lower judging work and increase payouts for the remaining gas, qa, and hm. Adding more judges or increasing judging pay does not achieve that goal, but having a large pool of continuously improving bots, I believe, will

DadeKuma commented 1 year ago

I 100% agree with the need for standardization so crews have something to optimize for. However, I'd like to point out that, quoting the C4 official doc: "The Lookout assigned to an audit will be responsible for reviewing and grading all Bot Race submissions within 1 hour of the Bot Race (i.e. within 2 hours after the audit launches). "

So the lookout is asked:

to review ~20 reports with 1k+ instances per report in a few hours

for a codebase he doesn't know anything about as the was code just disclosed

knowing that crews all copy one another so it's more and more difficult to differentiate as time goes

So wouldn't it be better to:

reduce the number of bot crews

make all bot reports public and out of scope: currently, findings that are not in the winning report are in scope, so they are submitted again during GAS and QA, which is really weird as the sponsor already "paid" for them.

give more time for grading so we can also hear the sponsor's opinion and we can do it when Lookouts know the codebase

Finally, do we really need to rank bots at every contest? By doing this we're highly incentivizing crews to run their bot and do some manual filtering to remove some false positives and improve the quality of their bots. But the idea of bot races should be to reach a point where proprietary bots could be run by the team and the whole thing could be automated. In addition to this, it's a bit unfair for crews whose timezone is hardly compatible with contest launches. We could grade them once in a while, during "qualifiers", and then every bot participating in a race would be paid according to its qualifying grade.

I like this idea, it would cut the review time significantly. But as @IllIllI000 said, as a downside we would have less powerful bots as there is less incentive to improve them

In addition to what @Picodes proposed, I would add the following rules for the qualifier:

A large enough codebase so that luck is less impactful
An extensive review that analyses each issue to be sure that it's definitely a good bot (this doesn't apply if the judge is sure that the bot can't win)
The judging phase can take more than 24h

I think it might work very well. The only problem is that bot crews will optimize and manually filter their report for these qualifiers, so maybe we should enforce an API if the final goal is automation

sockdrawermoney commented 1 year ago

Maybe we should start by moving toward a standardized way of submitting automated findings individually to an api endpoint alongside the single report, THEN see what we can do with that in terms of improving the overall process and judging consistency.

Attributes could be:

title
severity
type
filename.ext#linenumber
functions cited
description
validity

What else?

Some additional thoughts:

Bots who identify and remove false positives are providing a service in helping give faster clarity to judges—essentially submitting a false positive with an invalid flag is a 'vote against', so these bots could then still submit those issues but flag them as invalid as it would help judging overall. Providing a list of automated false positives to wardens is also beneficial from the same angle in terms of reducing spam.
To that end, I could see a 'confidence' score as a nice-to-have.
imo "type" should default to being an agreed upon spec, probably with the ability for a bot to add a new proposed type if deemed necessary. In that direction, we are essentially integrating tomo's categorization standard into the regular audit submission form already, though I think our implementation there may have been a bit hasty and needs a little more thought. Using bots to standardize here might help us there, too.
The ideal end state is not just to reduce spam and to eliminate overwhelmingly duplicated issues, but also for bots to be able to provide wardens with valuable intel during their audit. We should eventually build something like a vscode extension that allows for the ability for all competing auditors to see bot findings and false positives while they are reviewing code.

Tagging @nlf here for visibility on the api side

IllIllI000 commented 1 year ago

Submitting all false positives feels too close to giving away what our rules are. Not sure how to solve that other than letting each bot do the first pass of judging, which necessarily would mean they see all findings. A big downside is that writing code to do this would take time away from writing rules, so it might not yet be worth the cost.

IllIllI000 commented 1 year ago

A separate note on judging: Capping the points earned by the top bot will lead to bots withholding new rules if they think they'll win by enough of a margin, which will lead to slower bot progress and not filter as many findings. Also, unlike the other bots, the top bot doesn't have any non-known findings that they can separately submit as QA/Gas reports

sockdrawermoney commented 1 year ago

letting each bot do the first pass of judging, which necessarily would mean they see all findings

say more on this — maybe it's actually a desirable approach

IllIllI000 commented 1 year ago

Assuming people's formats eventually settle, it should be possible to write custom parsers of the markdown for each person's report. Using that, a bot can write variants of each rule that will flag the ones it doesn't have in its own list of true positives, and report those. That way as we see things in other reports we don't like, we can write rules to flag those for the judge. Like I said though, a lot of work

sockdrawermoney commented 1 year ago

I think having a botrace judging botraces is actually the ideal. Would really like to see what we can do to get there sooner rather than later.

That said, there's still so much value for everyone in surfacing the false positives in the general audit. I definitely understand the secret-sauce angle, though.

What if we took a different approach and had a separate endpoint for submitting false positives and only publicized the common false positives (eg those covered by ~3+ bots)?

GalloDaSballo commented 1 year ago

For the sake of getting closer to consensus, it would be ideal to have Bot Makers judge Bot Races (even mock) as a way to make sure they understand the process and the complexities of it (grind)

For the sake of defining clear goals, it would be best to have Staff confirm if the goal of Bots is to get rid of "spammy" reports

This can be agreed upon by exploring this hypothetical: In the hypothetical scenario of:

A bot with a ton of Low impact findings
Another Bot with a few unique Highs and meds (and not GQs)

Which bot should win?

IllIllI000 commented 1 year ago

That said, there's still so much value for everyone in surfacing the false positives in the general audit.

What sort of false positives are you talking about? I was referring to mistakes in rules of other bots, but it sounds like you mean something else. If you mean to filter out just plain wrong findings that appear over and over (e.g. delete vs assigning zero, which is the same gas either way), we can add a section to the bot report listing applicable invalid findings. I'm not sure how to reward those though, since eventually they'll stop appearing as people learn

sockdrawermoney commented 1 year ago

What sort of false positives are you talking about?

Consider that new wardens dive in and try to submit as many issues as they can to see what sticks.

My assumption is that many of those people are submitting things based on automated tools and chatgpt.

Having a list of automated issues is good because it includes things that are invalid but likely identified by automated tools.

Bot racers being incentivized to remove false positives is good for bot competition (and exactly what we should be doing) but giving wardens a list of issues they should explicitly NOT submit whether those are valid or invalid is valuable for the purposes of the bot race, which leads us to @GalloDaSballo's question:

For the sake of defining clear goals

<3 <3 <3

it would be best to have Staff confirm if the goal of Bots is to get rid of "spammy" reports

This can be agreed upon by exploring this hypothetical: > In the hypothetical scenario of:

A bot with a ton of Low impact findings

Another Bot with a few unique Highs and meds (and not GQs)

Which bot should win?

Outstanding question. I see it like this in terms of priorities:

High volume of low-effort submissions removed from being presumptively submissions.
Advancing automation to the point of identifying certain classes and patterns of vulnerabilities.

1 is the most important and most valuable goal (particularly near-term), but 2 is also a goal and an important one.

Therefore I see it like this:

doing a competitive job of 1 (eg you're roughly ~80% of the bot with the most low quality submissions) constitutes the threshold for being a good bot.
hitting 2 is what makes the great bots great.

JeffCX commented 1 year ago

I think this proposal needs to be finalized,

I judged one bot race,

here is my experience and question

https://docs.google.com/spreadsheets/d/1DTL5bjKPA58Y7ulETYE8tGD1WOkwotJlZlS8ar6Td_4/edit#gid=1809999941

What is the cutoff from grading A and grading B, I used 80% and 70%,

Some bot race using 80% and 60%

In Some bot race use 90% for grading A

also,

should we limit to number of grade A and grade B for each bot racing?

I purpose, we use sort the best score of bot submission and pick the best score as the winner,

then the following three bots get grade A,

the following three bot get grade B

we can don't use 80% and 60% cutoff at all

should there be post QA period for bot racing?

because judging bot race is subject to time constraints

the exact formula of grading.

SorryNotSorry suggest this one, which make sense:

mediums were 10 points, nice lows were 5, and nc's were 1 points. For gas optimization, any optimization greater than 100 gas savings were 10 pts, 100 to 50 were 3 and others were 1 points.

also we are seeing the bot can find high finding, I think a high finding can worth 3.3 medium

I also want to attach the private conservation I have with IIIII

https://gist.github.com/IllIllI000/e5d324783a00b2d4d7586dd8ef98a9a1

Instead of bias and favoritism, I would love to call is subjectivity

but I want to make this post to help standardize the process and minimize the subjectivity

picking up the best bot race is one thing, fairly reward everyone without mistake is another :)

JeffCX commented 1 year ago

Also, one point to add, the submission format of the bot can be standardarized as well

In fact, I suggest we merge the Low and non-critical / informational finding

the boundary between low and non-critical / information finding sometimes is very subjective and blur

the workload would not be scalable if we try to upgrading every NC that is low to low and downgrading every low that is NC to NC

so the submission format can be:

high severity finding,

medium severity finidng

and low severity finding

and gas saving finding

and we can enforce he bot submission to show how many gas saved by making certain changes as well to make judging the gas saving easier

IllIllI000 commented 1 year ago

As stated above capping points has negative side-effects. I'd rather go all out and find new rules, than have to massage my output so that I don't waste rules unnecessarily. I don't think top three should automatically be ranked A. Curious to hear specifically what the other racers think here.

I agree that the cutoffs should be standardized. I'm fine with 80/60/40 that DadeKuma suggested, but what is done for normal QA reports? It would probably make sense to match that.

If QA reports have Low/NC, I don't think bot races should be any different

DadeKuma commented 1 year ago

As stated above capping points has negative side-effects.

I agree, the max cap should be avoided.

I don't think top three should automatically be ranked A

Agreed, sometimes the cutoff is large and there is only an A grade plus the winner, so giving A to other bots would be unfair if their score is not good enough.

Other times, there is only a 1-point difference, and it would be too harsh to punish other bots for that if they score close to the winner.

If QA reports have Low/NC, I don't think bot races should be any different

Agreed, plus it's easy to pump out a lot of NC issues that don't give that much value compared to lows.

JeffCX commented 1 year ago

Suggest points and cuttoff for Grade A / B / C

if risk H: score = 33 if risk M: score = 10 if risk L: score = 2 if risk NC: score = 1

For gas optimization, any optimization greater than 1000 gas savings were 10 pts, 500 to 100 were 3 and others were 1 points

more than 80% of the best bot, A, more than 60% of the best bot B, otherwise C

JeffCX commented 1 year ago

What is the goal of bot racing? I think the score and points give bot crew different incentive,

do we want to incentive the bot to submit more impactful H / M or higher volume of GAS / NC / L? Then we can determine the scoring based on that

ChaseTheLight01 commented 1 year ago

For GAS I'd say use the same gas saving numbers but have the pts as 1, 2, 5 . instead of 1, 3, 10. I personally think 10 is too high to give to a GAS finding :)

IllIllI000 commented 1 year ago

With the previous scoring, it didn't seem like there was an incentive to improve gas past the basics, and to instead to pump out NCs. Having a higher score would change that. I'm not wed to the latest numbers, but do think it would improve gas report completeness to increase things from what it originally was. Maybe match point values of NC,L,M whatever they may end up being?

IllIllI000 commented 1 year ago

What is the goal of bot racing? I think the score and points give bot crew different incentive,

do we want to incentive the bot to submit more impactful H / M or higher volume of GAS / NC / L? Then we can determine the scoring based on that

https://github.com/code-423n4/org/issues/103#issuecomment-1582661499 catching 80% of low-effort/spammy submissions, but then reward extra that has been done on top of this. One option is we could count H/M a as lows for dividing the reports into the top 20% and bottom 80%, then for all the top 20%, add back much higher point values for H/M. It's a little more complex, but doable with =IF()

CloudEllie commented 1 year ago

Sharing some thoughts from the staff perspective:

First, I see staff's role as focusing on how to achieve the overall best outcome for the ecosystem.

In order to get clarity on that, it's important that we approach changes to scoring and awarding thoughtfully and deliberately. Anytime we've tried to rush a change to scoring or awarding mechanisms, we've run into unexpected challenges, so we've learned to take care and consider various scenarios.

There are a few different things being discussed here and I'd like to make sure I understand all of the open questions and how they interact. I'd appreciate hearing from bot crews and judges whether this list effectively captures the core concerns:

Scoring standardization: I'm hearing that bot crews would like to see a more predictable and consistent scoring model for how Highs, Mediums, etc. are weighted to determine bot score and therefore, ranking.
Re-judging, i.e. do we allow or encourage judges going to sponsors for validation, and adjusting score accordingly? And/or take this even farther and institute post-judging QA, for maximum thoroughness? (My personal pros/cons list -- Pros: would give bot crews more perfect feedback and ensure bot judging is maximally fair. Cons: massively increases effort required to judge; extends audit, judging, and awarding timelines.)
Should we share all bot reports as known issues? (FWIW, on this point my gut instinct is that while this seems logical at face value, in practical terms it would make the "known issues" list overwhelmingly long and full of duplicates, and therefore unusable to wardens competing in the audit.)
Should judges/lookouts use random sampling to judge bot reports? (Seems this has not really taken hold as a norm, as it's in tension with bot crews' interest in more thorough and rigorous judging and scoring.)

There's a deeper set of questions beneath this, which @JeffCX summarized nicely:

What is the goal of bot racing? I think the score and points give bot crew different incentive,

do we want to incentive the bot to submit more impactful H / M or higher volume of GAS / NC / L? Then we can determine the scoring based on that

As @sockdrawermoney wrote here, I think we actually want to incentivize both -- which means that we do need to iterate on the awarding mechanism for bot races (and possibly scoring too).

Before we push to iterate on the current set of defaults -- which is actually still pretty new and emergent -- I'd like to a) ensure that we're fully understanding everyone's needs here, and have had an opportunity to reflect on how best to meet those needs alongside creating the best possible outcomes for the audits.

GalloDaSballo commented 1 year ago

In relation to my proposal I have shared a Tool I have used to judge Bot Races:

Demo

https://twitter.com/GalloDaSballo/status/1676522421812178945

Walkthorough (with videos):

https://docs.google.com/document/d/13jJ-hLRqrC9fLYsA_b6pJ927mQY_uBHObEI7mSWPFOk/edit?usp=sharing

App (scrape MDs and Score them)

https://pharaoh-omega.vercel.app/

Scripts

Score, compare and auto-judge reports https://github.com/GalloDaSballo/string-regex

evokid commented 1 year ago

With the previous scoring, it didn't seem like there was an incentive to improve gas past the basics, and to instead to pump out NCs. Having a higher score would change that. I'm not wed to the latest numbers, but do think it would improve gas report completeness to increase things from what it originally was. Maybe match point values of NC,L,M whatever they may end up being?

if Higher scores only relays on H/M then we could have a gap to increase the efficiency of Gas issues specifically, from few reports I can see the focus on L/NC issues. maybe I am wrong but I would go with @IllIllI000 on this (not sure if I got his point correctly).

IllIllI000 commented 1 year ago

In cases where a rule has only (or maybe even a majority of?) false positives, that rule should get points taken off by the judge that sees it. If points aren't taken off, the bot writer has no incentive to fix the rule, and may just leave it as-is in their report generation. In a later race, a less thorough judge may miss the bug (e.g. due to time constraints), and the bot would get more points than it deserves. The other bots have no way to dispute such cases after the fact because the outputs of the non-winning bots aren't shared, so only the judge can prevent this scenario. Invalid rules provide negative value to the sponsor.

Getting rid of these false-positives will also make future judging easier

evokid commented 1 year ago

"after the fact because the outputs of the non-winning bots aren't shared, so only the judge can prevent this scenario." to be honest all reports must be shared otherwise a progress for learning bots would be really limited since I can't check most of the reports and spot specific invalid or valid cases that I need or could have a discussion. Ty for the explanation.

0xA5DF commented 1 year ago

Hey I’ve read through this thread, and would like to suggest a way to streamline the process, shorten the time it takes to publish the report, and also make the judging more accurate.

Given that the bots are just finding instances of common issues - we can simply create a list of all of those common issues, and have the bots report in a data-format (e.g. json/yaml) all the instances of those issues. If a bot comes up with a new type of issue that’s not on the list - they can ask to add it before the contest begins.

With that, instead of publishing a report from a single bot - we can simply merge all of the reports into one (we can set a minimum threshold to avoid false positives - e.g. at least 2-3 bots reported the same instance), convert it to a human-readable format and publish it. Both the merging and conversion can be done by a script, shortening the time from the race end to publishing.

Judging the race can be done independently of publishing without any pressure. Any finding that isn’t unanimous would be judged for its validity. The final score of each bot would be composed of both positive scores for valid findings and negative score for false positives.

This is similar to the API that Sock suggested, but it’s a more ‘lightweight’ solution as it doesn’t require too much infrastructure (i.e. building an API point)

DadeKuma commented 1 year ago

@0xA5DF

Given that the bots are just finding instances of common issues - we can simply create a list of all of those common issues, and have the bots report in a data-format (e.g. json/yaml) all the instances of those issues. If a bot comes up with a new type of issue that’s not on the list - they can ask to add it before the contest begins.

This is a great idea and I've built a tool that facilitates it: https://www.botracer.xyz/issues

We need the collaboration of all bot racers/judges to complete this list. As for now, I've added most of the public duplicate findings, but there is a lot missing, as only the winner report is published.

This is the JSON list: https://github.com/DadeKuma/bot-racer/blob/main/public/data/findings.json

IllIllI000 commented 1 year ago

I don't like any solution that consists of centralizing rules for anyone to copy, including the findings.json above. One way to not do this, is to have bots vote for valid/invalid based on partial hashes https://blog.cloudflare.com/validating-leaked-passwords-with-k-anonymity/ . Each bot would create a blob for each finding in a standardized format, hash it with a per-contest salt, and then submit the first n characters of the hash of each finding. That way, a bot could submit a list of hashes of their valid findings (and label the finding with that hash for easy comparison), and a list of any they think are invalid (without any title etc), and the scoring could be automated without knowing the details of the findings. The hash intersections could be published for each bot so that if we know an invalid finding, we can be confident that it was counted as invalid

edit: I've implemented this solution here. A few judges have used it and have found it helpful

0xA5DF commented 1 year ago

I don't like any solution that consists of centralizing rules for anyone to copy

Can you elaborate more on this? Given that each contest there's one report that becomes public the list of issues isn't much of a secret I guess

We can also award bots for adding more issues to the list - e.g. each time an instance of this issue is found the 'creator' of the issue gets part of the reward

IllIllI000 commented 1 year ago

I can't help the one report being made public, but I don't want to contribute to a public analyzer

DadeKuma commented 1 year ago

But it's not a public analyzer. It's just a collection of issue titles to avoid cherry-picking duplicates manually. There isn't a description, lines of code, or anything that could be used to track how the ruling for that finding works.

There are just the titles that are available on the table of contents in each report. It's just an aggregation of data that is already publicly available.

IllIllI000 commented 1 year ago

I understand, but it's still on the path towards something I don't want

0xA5DF commented 1 year ago

We can keep the list private and share it only with relevant members of C4 (bot owners, lookouts etc.) if that helps

ChaseTheLight01 commented 1 year ago

I propose raising the barrier for Grade A from 80% to 90%. I feel grade A should be considered #1 contenders so 90% seems more fitting to me. This further incentives bot racers to keep maintaining their bot as A) there is greater financial incentive as achieving a grade A would result in greater rewards B) maintaining a grade A bot would be more challenging hence encouraging development. I am very curious to what other Bot Racers think about this, please share your thoughts!

IllIllI000 commented 1 year ago

Given the discrepancies seen in the latest races' results, wrt differences in severity and bot specificity, I don't think changing the score cutoffs would be fair without having post-judging QA

IllIllI000 commented 1 year ago

Some bots are submitting lots of false positives, as well as rules that are just plain wrong, or that have been flagged as wrong multiple times (e.g. flagging fee-on-transfer when there aren't any ERC20s in the project). The consensus from the bot owners seems to be that penalties for false positives should be large, and judges shouldn't give bots a pass on these, so that bots are incentivized to fix their rules immediately. It is very easy to fix a rule to exclude conditions, so not doing so is just sloppy and should not be rewarded. If there are penalties, the bot owners cannot claim that they didn't know there was a problem for them to fix. Does anyone have a different opinion on this, or have a different suggested approach?

DadeKuma commented 1 year ago

Agreed, but false negatives should be zero (i.e. invalidating a valid finding). If in doubt, the judge should not invalidate the finding, unless they are 100% sure it's invalid.

code-423n4 / org

Bot race judging standardization proposal #103

TL;DR Update - 30/Oct/2023

Final Rank

Score

Penalties

Disputed Issues

Post-Judging

Original Thread

1. Risk alignment

2. Issue sampling

3. Report structure analysis

Structure

Content Quality

Specialization

4. Scoring

Pre-Preface Why are we here?

Preface - Keep It Simple

Criteria for Judging

Discretionary Bonus

Rating by Severities

Lack of points for Meds and Highs

Step 1 - Pre-filtering

Pre-filtering algorithm

Step 2- Grinding it out

Step 3 - Tie breakers

In Conclusion

The Future

Can we agree on the Process?

If yes, how do we make it happen?

On Judge Pay / Workload

On Gut Instinct

On The Long Term

Demo

Walkthorough (with videos):

App (scrape MDs and Score them)

Scripts