TL;DR Update - 30/Oct/2023

Given the length of this thread, and how the process evolved in the meantime since the original thread was started, these are the current expectations from the judges:

Final Rank

>= 80% Rank A
>= 60% Rank B
< 60% Rank C

Score

For each valid instance:

H +20
M +10
L +5
NC +2
GAS-L +5
GAS +1

Penalties

Subjective issues that are considered invalid should not be penalized, but they can be ignored. If not sure, an issue should be ignored, and not penalized.

For each invalid instance:

H -20
M -10
L -5
NC -2
GAS-L -5
GAS -1

Disputed Issues

This section should not be awarded or penalized, but it can be used while judging. It is recommended to double-check a disputed finding before penalizing the referred one, as the former may be invalid.

Post-Judging

It's expected that the judge will share a judging sheet with detailed information about the winner's score (e.g. this report) The judge should also send each bot crew their respective detailed judging sheet when asked on DMs.

Original Thread

It is really important to have a coherent standard between bot races, and this proposal is an attempt to define some guidelines for the judges.

Reports are starting to be very long, and so it is starting to be very hard to judge if the main repository contains lots of lines: sometimes it's simply not possible to review every issue for every bot, as it would result in millions (!) of lines to be read by the judge in less than 24h.

I propose the following phases that should be handled by the judge in every bot race.

Phases:

Risk alignment (first pass)
Issue sampling (second pass)
Report structure analysis
Scoring

1. Risk alignment

The main goal of this phase is grouping/clustering the risk of each issue by looking at its title: this is to ensure that the next phase is fair.

If a bot crew submits an issue with a wrong risk rating, it should be upgraded/downgraded accordingly. If this happens multiple times on the same issue, the judge MAY issue a warning. If this happens again after the warning, the issue is treated as invalid.
If two bot crews submit the same issue with a different risk rating, the judge should apply the same risk rating to every duplicated issue.

GAS issues should be marked as [GAS-L, GAS-R, GAS-NC] by the bot crews with the following criteria, based on the optimization:

GAS-L: storage or >1k gas
GAS-R: calldata/memory or 100-1k gas
GAS-NC: everything else

2. Issue sampling

We should decide on a specific % of issues that should be sampled in each category, with a minimum of m and a maximum of M issues to be reviewed.

These numbers (m and M) should be constant for every race, and they should be chosen so that it is possible to judge any contest in less than 24h.

The judge will review a max of M * categories and a minimum of min(largest_report_issues, m * categories) issues, where categories = [H, M, L, R, NC] = 5 and largest_report_issues = total number of issues of the largest report, per report.

For each sampled issue, the judge should leave a comment choosing from:

A (valid)
B (valid, but with false positives with range (1 - 20)%)
C (invalid, plain wrong, or false positives > 20%)

The judge should pay attention to the number of instances if this issue has duplicates. If this number is less than 50% of the number of the best duplicate, it should be penalized by reducing the rank, once (e.g. A -> B, or B -> C).

This is to avoid bots that submit just a single instance of an issue to avoid false positive penalties.

3. Report structure analysis

This step ensures that the report is well-written and readable. A report with bad formatting is not very usable, so reports should also focus on structure quality.

This is a subjective judgment, but it is split into categories to ensure fairness.

The judge should review the following, giving a score between A and C:

Structure

Good examples: follows the table of content best practices, the report flow is easy to follow

Bad examples: issues are put in random risk order, lots of links/references, no formatting

Content Quality

Good examples: Good grammar, good descriptions, conciseness

Bad examples: Wall of text, fails to correctly explain the issue's impact, zero references for non-obvious issues

Specialization

Good examples: able to understand the repository context, able to describe an impact for the specific protocol

Bad examples: generic descriptions of issues, generic issues that are technically valid but are impossible to occur due to how the project work

4. Scoring

The main goal for the bot race is to generate a report for out-scoping common issues: quantity and quality should be the main criteria, while risk rating should be secondary.

Main reason is to avoid having a report winner that finds multiple high issues but almost zero low/NC.

For these reasons, risk rating should be flat instead of percentage based.

Main scoring formula:

    score = 0
    for each risk_category:
        score += issue_rating_avg * risk_category_qty * risk_category_multiplier
    score += report_structure_score

issue_rating_avg formula:

    score = 0
    for each sampled_issue:
        if judging A:
            local_score = 1
        if judging B:
            local_score = 0.5
        if judging C:
            local_score = 0
        score += local_score
    score /= sampled_issue_qty

risk_category_multiplier formula:

    if risk H:
        score = 9
    if risk M:
        score = 7
    if risk L:
        score = 5
    if risk R:
        score = 3
    if risk NC:
        score = 1

report_structure_score formula:

    score = 0
    for each report_criteria:
        if judging A:
            local_score = 7
        if judging B:
            local_score = 4
        if judging C:
            local_score = 0
        score += local_score

The bot score is then applied to a curve, to give the final score that will be used to calculate the rewards.

Winner: highest score A: at least 80% of the winner's score B: at least 60% of the winner's score C: at least 40% of the winner's score D: less than 40% of the winner's score

Rate C and D get zero rewards.

A bot is disqualified (can't compete in races until it is qualified again) if any of these conditions occur:

Single D grade
Does not participate in 2 races in a row

Some bots are submitting lots of false positives, as well as rules that are just plain wrong, or that have been flagged as wrong multiple times (e.g. flagging fee-on-transfer when there aren't any ERC20s in the project). The consensus from the bot owners seems to be that penalties for false positives should be large, and judges shouldn't give bots a pass on these, so that bots are incentivized to fix their rules immediately. It is very easy to fix a rule to exclude conditions, so not doing so is just sloppy and should not be rewarded. If there are penalties, the bot owners cannot claim that they didn't know there was a problem for them to fix. Does anyone have a different opinion on this, or have a different suggested approach?

I agree with this. Recent winning bot reports have been showing lots of invalid findings but are still granted points in the final score, as currently each judge have their own method of reviewing bot reports. I believe that until there will be some sort of post-judging qa for the selected winning bot race, we will continue to see reports containing findings that are just wrong. At the current state, I don't think that increasing the penalty of invalid findings will help much if current winning bot reports contain invalid findings that are clearly wrong but yet are still given points as valid findings with no option for a post-judging qa to invalidate them.

Recently, we had a lengthy discussion on Discord because many of us felt that the new judging rules were unfair, especially since they were changed without notice

As such, it is highly recommended that Judges provide bot crews with an appropriate time frame for notice if they intend to introduce new judging rules to the standard

Problems

We are currently facing two problems:

Bot races are turning into a farming game that provides little value to sponsors. They don't want to read the 100th variant of "missing natspec" in their documentation. Many bot crews are splitting their existing rules into subrules that offer no value, yet they still earn many points for them.
Bot crews are dissatisfied with the current penalties. When penalties are too low, it results in spammy reports. On the other hand, when penalties are too high, people shy away from innovating with new rules due to the fear of being punished. This is especially true for H/M rules as a single one of them that gets invalidated will cost you the bot race.

Solutions

Problem 1

I believe NC issues should be worth much less than they currently are. Sometimes, they are worth 1 point, other times 2 points: this is excessively high and encourages bots to submit many of them.

If you look at QAs submitted during contests (by wardens), you'll notice that almost no one submits NC issues since the introduction of bot races: so this problem is already solved. Now, the bot report should also be valuable to the sponsor.

I don't think this is the case with NCs, especially if they are rules that have been split and provide little to no value.

Consider reducing NC points to 0.5 or 0.25 points each to address this behavior, so that more impactful findings (L+) are the main focus.

Problem 2

Penalties tied to risk ratings need to be removed. In this way, we incentivize new rules (especially M/H) as bot crews won't fear being penalized, making it safer to try to innovate.

Bot crews are also "downgrading" their own risk rating to minimize the possibility of being penalized: if they are wrong, they lose points, but if they are right, their finding will get upgraded without any risks. This is not good.

Quality is also crucial, and a report filled with errors is unacceptable. The quantity of false positives is key here:

Consider penalizing based on the number of invalid issues on an exponential curve.

Formula:

$I^{penaltyCoeff}$

Where $I$ represents the number of total invalid issues, no matter the risk rating. A good starting $penaltyCoeff$ could be a number between 1.3 and 2.0

Examples:

Hound has the following invalids: 2M + 3L + 4NC -> 9 invalids The penalty with $penaltyCoeff = 1.6$ should be $9^{1.6} = 33$

This penalizes a large number of false positives, examples:

$(15^{1.6} = 76)$ $(20^{1.6} = 120)$ $(25^{1.6} = 172)$

Not opposed to this change, which would help to prevent farmers from winning on that alone, but I don't know that it solves the sponsor quality issue, since there are still points given
Interesting proposal. If it were applicable to only NC/GNC, I'd agree, but I think an additional facet to the problem of false positives in M/L is that they're highly visible to sponsors, and thus false positives there should be penalized harshly. I am skeptical that a possible penalty on a new rule is reason enough not to write one. If it's a new rule, it should be thoroughly tested before being applied, and you should probably have a category in your own bot backend for 'probationary' rules, where you're forced to manually verify it for every race, until is no longer probationary. I have a 'confidence' field for every rule, and I don't even see the ones that have 100%

Problem 1: splitting rules. I think the best solution is we standardize the titles of issues. easily solves this problem(also solves the severity problems one L and another M or NC. All of them should point as same). your solution is reducing points. personally, I disagree. splitting rule happened when the judge pointed it this kind of issues and this situation forced almost every bot racer to split their rules.

Problem 2: Penalties penalty_key: an exponential or everything like higher penalties I think penalties per severity are wrong and I agree with you. and the number of invalid issues is important. But the way you suggest to penalized I disagree. (the penalty_key) for example, Bot A has 10 false positives and Bot B has 5 false positives. Bot A is 10x and bot B 5x (assume x = -3). here bot A penalized more than Bot B and that is enough. But the penalty_key solution just tried to remove bot A not penalizing it.

How can be innovative when just one or 2 people win the all races? with this kind of solution just removing new bots and new efforts and new rules.

negative thinking: This just crossed my mind the older racers just have the right to win. When they just try to change the rule based on their bot strength. p.s. I participated on 28/38. I may be considered an old racer.

Listing out all of the rules does not solve the problem of deciding when to split them. We all regularly see and part take in the farming to keep up everyone else. It's a matter of agreeing on a rule for when it's appropriate to split a finding, which is completely orthogonal to standardizing titles.

I am very glad that the henry bot is participating and winning, because it shows that if you put in the work, you can get first place many times. Just because I have a vision for what's good for races long term, and tailor my approach to it, that doesn't mean my suggestions are solely for my benefit. It's not like I'm sitting there resting on my laurels - I put in a ton of work every contest to make sure I look at prior findings and incorporate them, including the effort of fixing rules that are sometimes broken, which not everyone does.

My hope is that everyone eventually fixes all of their rules, and we can compete on the basis of who has the newest valid finding, rather leaving it up to whether the judge was too overworked to identify all of the false positives.

code-423n4 / org

Bot race judging standardization proposal #103