code-423n4 / org

Code4rena Governance and Discussion
71 stars 17 forks source link

Similar exploits under a single issue #8

Closed IllIllI000 closed 9 months ago

IllIllI000 commented 2 years ago

There are cases where there are multiple instances of an exploit within a single contests and it's not clear whether they should be filed as separate submissions or as one submission with multiple sub-issues. Disregarding whether the findings are valid or not, these are four examples where a single category of exploit was submitted by the warden multiple times and each submission was awarded separately. Recently in the code4rena discord chat, wardens were recommended to file such instances as a single submission with multiple sub-issues, but based on the timeswap findings this was called into question. It's not clear whether doing so will be awarded the same way as if a warden files separate issues, and it's not clear how such combined submissions will be identified in the final report, given that currently each separate submission is given an issue number.

I am proposing that wardens SHOULD combine similar issues into one submission with multiple sub-issues, and that judges MUST award each instance separately, and that the final report MUST indicate separate issue numbers for each instance (e.g. [H-01,02,03]). Each separate instance MUST be clearly identified by the warden in the submission; the judge must not have to interpret the warden's intent.

sockdrawermoney commented 2 years ago

Looking forward to judge input on this one.

If I recall correctly, the answer to this has changed over time in terms of judge preferences—largely based on the fact that we had a stretch where we experienced so many issues. It may change yet again in light of the new QA report which has seemed to significantly reduce the total number of issues.

The specific problem in this issue (how to handle multiple submissions) and the general problem (C4 coordinating a consistent answer in how nuances are handled while still prioritizing judge discretion) are the kinds of things this rulebook repo is aimed at solving, and getting folks to successfully move in that direction will help us get where we want to be.

I want to be clear that I'm neither endorsing @IllIllI000's proposed solution nor opposed to it, but I am interested to see the tradeoffs surfaced from judges' perspectives and for us to reach a point of clarity here.

GalloDaSballo commented 2 years ago

Consideration: The fact that we present 1 or many findings has no relevance to the Sponsors. As long as the sponsor see all the instances of the vulnerability, presenting it as 1 or many makes no difference.

Meta: If one warden (WardenA) submits the same type of findings in multiple locations, that's ok and they can be judged accordingly, especially if the findings have POC.

If another warden (WardenB) submits all the findings under one, then there's a very high likelihood that their submission will become the primary issue.

At that point WardenA's submission will become a duplicate of WardenB's, which means we can give them points for one and then invalidate all the others.

So to me this is a self-resolving issue in the style of prisoner's dilemma. If all wardens submit all findings separately they will dilute the pool more but also get all findings as valid. However if one warden bulks the findings, the now duplicates are invalid hence the pool is less diluted.

To me this is the appropriate way of addressing this as this is not really relevant to sponsor but is purely for points

GalloDaSballo commented 2 years ago

TL;DR: If one warden bulks, all are bulked. If nobody bulks, judges should not bulk on behalf of wardens

alcueca commented 2 years ago

I've also applied the same rule as @GalloDaSballo in my judgings, as I find it fair and self-enforcing. Over time, wardens should learn to bulk, particularly so if a point bonus is given to the warden whose submission is taken as the main one.

IllIllI000 commented 2 years ago

@GalloDaSballo and @alcueca how would you handle the case where WardenA's and WardenB's bulks have some issues in common and some not, e.g. WardenA:{1,2,3,4} and WardenB:{3,4,5,6}, where the full set of issues is {1,2,3,4,5,6}?

GalloDaSballo commented 2 years ago

Have yet to see such a black and white scenario where:

Typically you'd have a more generic submission that encapsulates all 6 (or most, perhaps a few can be missed), and then the rest get's duped out and each warden get 1 submission to be valid.

If you have a specific example am happy to talk over but I don't think I've ever seen your example actually happen in contests.

Most of the time there are wardens that will submit one for each, and some wardens that submit the bulk issue and they all get bulked, have yet to see a situation where the broader bulked issue doesn't cover some aspects and this type of set of findings reasoning is helpful.

IllIllI000 commented 2 years ago

I have a feeling we'll see more examples now that this has been documented for wardens to see. I guess the relevant judges can post updates here if/when it happens - hopefully before awarding and reporting are done. My main worry was bulks being counted as the first issue in the submission text of the bulk, with the other issues listed further down in the text as ignored, but it sounds like the judges (at least you two) are aware of this and have handled it in a reasonable way in the past.

IllIllI000 commented 2 years ago

@GalloDaSballo actually, isn't it still worth it for each warden to split? If there are six issues and the bulk has five of them, the warden that finds the sixth and submits it separately (with or without submitting the others) will get half of the pool rather than one sixth of the pool, will they not?

sockdrawermoney commented 2 years ago

For the sake of collecting thoughts on this topic in one place, I'm going to add the notes I have from the first conversation here on bundling issues, which took place in May 2021:

Rajeev:

Question: Should a finding that is present in multiple contracts or different lines of the same contract be reported as different issues or in one issue where all lines/contracts are listed?

Rajeev:

IMO: They should be reported as different issues (for C4 contests specifically) because of three reasons:

1) Someone could have reported that finding for contract-1 but missed for contract-2 (same for different parts of the same contract). Reporting unique issues (i.e. same type of finding for different locations separately) makes judging easier so rewards can be calculated accordingly.

2) Same finding could be a real/bigger problem only for one contract and not for another contract as determined by the sponsor/judge. So the risk rating could be different for the locations although the finding type is the same. Calling out all locations in one issue could mix up risk ratings, dismiss the entire submission or make it harder for judges to evaluate.

3) Separate issues will perhaps have a less likelihood of the project missing out on addressing the findings in all locations than if they were clubbed together.

onewayfunction:

I would default to submitting multiple issues. For example, reporting all instances of potential overflows as a single issue would hide the fact that some instances may be critical (resulting in loss of user funds) while others could be benign. Also, having wardens lump them all together encourages laziness from wardens. When they find one instance of a particular class of vulnerability, they could effectively claim to have found all of them (unless we specifically require that they list all instances when submitting the issue).

Rajeev:

The only reason I see for lumping is that it may help projects look at the issue and fix all related instances without having it go over any more issues.

Rajeev:

I think there are other aspects that are more debatable, subjective and perhaps contentious:

1) Sponsor/project readiness: Some projects might come to C4 post-audit(s) and close to deployment where the code readiness is higher, which impacts the time/effort of wardens because they have to spend a lot of time as low-hanging findings are already reported/fixed (effectively bug-bounty like). Other projects might come to C4 pre-audit where the readiness is lower and sometimes even with missing functionality, which again impacts the time/effort of wardens because they have to make assumptions on what may be implemented/deployed to report findings and might get stuck with low-hanging findings without getting to the more critical ones. Both are reasonable C4 targets but there might be different expectations and efforts involved.

2) Risk rating: There is no perfect way here. Yes, OWASP is helpful but its ultimately a subjective call. This matters in regular audits for when projects want to publish their reports because many high/med's may be perceived poorly (despite being fixed). This is trickier in C4 because I'd assume most wardens are doing this in their "spare" time and don't necessarily have the time/information/context to justify their ratings with detailed exploit paths. They have an incentive to mark it higher because they're security purists (have seen enough downplaying and seemingly CQ issues becoming critical on deployment) and/or seek potentially higher rewards — this is where the burden falls on judges to evaluate with sponsor's help and downgrade rating if necessary. We need to balance the burden and expectations on quality/quantity/ratings to make it a win-win-win for sponsors-wardens-judges.

jvaqa:

If we make it a norm to submit multiple issues rather than lumping similar issues, that would potentially take up a lot of the judges' time. Even if they could fairly quickly categorize them, I still want to be respectful of the judges' time and not burn them out.

sock:

Heard from Nick Johnson a bit ago. His take as judge re: Rajeev’s question regarding multiple issues is that they should indeed be split but then deduped prior to the report.

JMukesh:

I agree with Rajeev since same bug present at different lines can affect the whole contract differently. For example - let's take simple one , zero address validation.

  1. Lack of zero address validation in transfer() function of erc20 token can result in loss of fund
  2. Lack of zero address validation in IsMemberOrNot(address) function will always return false if input is address(0).

In these two scenario vulnerability is same but their impact is different

Thunder:

tbh I prefer grouping the issues but that's because I am usually lazy and short on time and don't want to write huge paragraphs of text about every single instance. However, after reading the arguments of Rajeev I see a lot of good insights why separate issues would be better. I guess we can't fully eliminate the human factor here, it comes down to judges and project team to review and assign the scores to the issues. But we should definitely think how to improve it so it will not be possible to gain advantage against other participants just cuz of lack of clarity.

cmichel:

Same, I like the idea of grouping them by (type-of-issue, severity) so you can still list the occurrences you believe to be of high-severity (and that should earn more points) in a different issue. It would also be less to judge (as it shifts the grouping of issues from judges/sponsors to wardens to some degree) and I'm all for processes that help the judges, it seems like the judging doesn't scale well at the moment because we get more and more wardens & submitted issues.

Lucas Manuel:

I think one possible solution for this would be to group non-critical findings (comments, Natspec, documentation) in submissions and leave the low/medium/high severity bugs as separate issues. When it comes to code changes I am more of a fan of discrete references to bugs in separate issues. There is the potential for a lot of issues of the same type to show up in this case, which could be a burden to the Judge. I think its best to have the burden of de-dupping/grouping issues go to the sponsors. They can do a first pass where they acknowledge/dismiss/confirm issues, mark duplicates, and group issues of similar types that are not exact duplicates. Then, in the final report, these similar issues under the same group can be consolidated so that the final report is neater.

I think the sponsor creating a new label with a short description of the "type" of issue would make sense

strictly-scarce:

If the finding (and corrective action) is the same per multiple instances of the finding, ie, doing an !address(0) check for an address as a passed parameter, which is clearly a pattern across multiple contracts, then there is no reason to list them out separate. There should be a single filed finding, with a list of affected functions/contracts.

With 320 filed issues on the repo with 10 contracts, having 7-10 filed issues per finding that is essentially the same, creates noise, and the noise impacts our ability to parse them all.

0xsomeone:

My 2 cents on the "one-per-issue":

I have extensive experience with smart contract auditing companies and it really depends on the situation.

A simple (lack of) zero address check for the transfer of ownership for example is not severe and would be grouped with a zero address check for an ERC20 transfer.

A lack of a zero address check for something more elaborate, such as the ecrecover function, would need to be detailed in its own finding as it can have severe consequences I.e. impersonating the zero address.

In general, findings should be grouped by similarity as long as their severity is equivalent.

strictly-scarce:

findings should be grouped by similarity as long as their severity is equivalent.

agree with this, this is the essence of our objection, that so many duplicate findings are equivalent in form and function

sockdrawermoney commented 2 years ago

Judges discussed this in early February as well and it seems there is actually pretty solid consensus around the approach @GalloDaSballo is describing.

CloudEllie:

JtP and I were just discussing a judging question that I'd love your thoughts on.

If a warden submits one finding in several different issues (trying different angles on the same issue), do you typically invalidate the duplicates, or allow multiple submissions as valid within the same reportID grouping?

In particular, if other wardens submitted the same issue, then leaving valid dupes from the same warden would mean that Warden A would get a bigger share of the award. But if only Warden A submitted it, then I guess dividing the award among multiple submissions actually penalizes them a bit?

I think invalidating the dupes is the better approach, but would appreciate hearing how you've approached this in the past. I think it would be helpful to build consensus around this as well.

Alex The Entreprenerd:

I tend to let the findings as valid until another warden submits a bulk submission. If there's a bulk submission that becomes the primary. I will then leave pick one of the "single findings" and mark as duplicate of the bulk submission, and since a duplicate of the same issue from the same warden doesn't make sense, will then invalidate the other findings

Alberto:

I do exactly the same

0xean:

yup, this is also how I try and handle it as well

0xleastwood:

+1 to this. However, if it is clear that the warden is gaming submissions I'll only allow for one of their submissions.

Thunder:

I always group similar submissions and invalidate duplicates from the same warden :goose:

JtP

I believe the judge should leave the dups as valid, just duplicate to the main issue, because:

  1. they are valid findings after all;
  2. invalidate based on if an issue from the same warden has been included in the group adds op cost for the judge.

In the meantime, awardcalc should be updated to dedup findings from the same warden under the same reportID and not let the warden to get a bigger share. https://github.com/code-423n4/awardcalc/blob/main/award.ts#L22-L26

sock:

JtP: This approach with regard to deduping chained warden submissions tied to the same reportID makes sense to me. Could you file an issue on the awardcalc repo describing the change you're proposing?

JtP:

https://github.com/code-423n4/awardcalc/issues/9 Amazing! a issue for that is already there

CloudEllie:

except I think what you're describing is the inverse of how it's described in this issue, right? Instead of this:

the desired behavior is to have the award calculated the same as if the duplicates came from different wardens.

I would rewrite as follows:

the desired behavior is that each warden within a set of duplicates receive only one share of the award.

Does that sound right?

JtP:

Oh! that's right!

sock:

so lol maybe that old issue was actually implemented and we've now moved consensus in a different direction? would be good to get a sense from a few judges on that specific proposal

CloudEllie:

Yeah, I suspect that's the case. And agree would be good to get a little more input -- although I'll admit I struggle to think of a case where we would want to award multiple "slices" of the award to the same warden

0xean commented 2 years ago

Thanks for aggregating all of this. If we think about the customer perspective of a sponsor, it is way way easier to wade through a single issues that highlight all instances of the same vulnerability. It also makes it much cleaner from a traceability perspective for future reviewers who may be consider using the smart contract.

IllIllI000 commented 2 years ago

another potentially confounding case: what happens if WardenA finds and submits only one issue out of a bulk of n issues that WardenB has submitted? If it's marked as a duplicate then does that mean WardenA undeservedly gets credit for WardenB's other issues in the bulk, or is it handled some other way?

IllIllI000 commented 2 years ago

revisiting the game theory explanation, if wardens WardenA and WardenB know that WardenC always bulks and usually finds most of the issues, then it's in WardenA's and WardenB's best interest to each submit issues separately rather than in bulk. If WardenC finds everything, everyone gets the same reward due to the de-duplication algorithm @GalloDaSballo outlined. If WardenC misses one, but either WardenA or WardenB find the missing one, they get more reward than if they had submitted all issues as a bulk, since the sybil protection part of the reward calculation penalizes duplicates and rewards distinct issues:

sock

The maximum money paid out for a single bug is if one and only one person find it. If two people find it, everyone makes less from that same bug because the slice for that specific bug decreases, and is now split as well

https://discord.com/channels/810916927919620096/810916927919620099/870723658312732753

GalloDaSballo commented 2 years ago

WardenA -> Bulks: 1, 2,3, 4 (Misses one) Warden B -> 1, 2, 3, 4, 5 Warden C -> 1, 2, 3 Warden D -> 3, 4, 5


Dedups


Warden A -> Main issue Warden B -> Dup of A, invalid, invalid, invalid, invalid Warden C -> Dup of A, invalid, invalid Warden D -> Dup of A, invalid, invalid

Because one of them bulked, each only get one.

In practice: 1) ERC20 doesn't check for return and is untrusted input that could block function (generic, links to spot 1, 2 and 3)

ERC20 doesn't check in xyz.sol (Dup of 1), ERC20 doesn't check in abc.sol (invalid as Dup of 1 from same warden)

I think the "algorithm" is friendly to wardens that are not as accurate as they get bulked into the bigger one, and punishes the "spammers" as the copy-paste "didn't check ERC20 return eheh XD" finding doesn't get spammed enough.

From my experience once you find one of these findings, the sponsor appreciates a bulked report so they get links to fix, but ultimately they are made aware of the issue and will then fix in all spots naturally.

So in terms of Sponsor experience they don't care. (Talking as sponsor - BadgerDAO) As for wardens, they ultimately all get points. The top warden may be annoyed that we help the least accurate wardens.

However I feel like all of the wardens' time would be best invested in finding a more difficult finding, from my experience it's the unique findings that help, and a finding that is repeated X times in the code has been in practiced not as rare of a finding (ERC20 return checks, isContract used improperly, DOS opportunities with loops, etc...)

GalloDaSballo commented 2 years ago

Happy to argue over specific examples, and the day multiple critical finding that took wardens hours of work, with fully detailed PoC with charts, code, demos get bulked out and loose the prize they deserve is the day we can truly put this approach to question.

However I feel this approach applies to low effort low hanging fruits (even certain med severity are, see lack of safeERC20 as the classic example), so in practice I don't think it makes as big of a difference.

IllIllI000 commented 2 years ago

Marking 5 as invalid for B and D is not following the awarding rules or the guidance of only closing identical issues as duplicates. I think this is a big problem. You're giving different payout amounts based on the submission strategy of a different warden rather than on the merits of the individual warden's submission, which is against the spirit of the contests.

IllIllI000 commented 2 years ago

@GalloDaSballo this is not just hypothetical - I've found an example where high-severity issues were involved and rewards were miscalculated based on your approach to bulks. In this issue, @MrToph identifies five different instances of the issue, with valid attack vectors for each one. @WatchPug finds two of them in this issue and one in this issue. WatchPug's issue 24 was marked as invalid, and WatchPug's issue 23 was marked as a duplicate of cmichel's. cmichel did more work and identified five high-severity issues but was essentially only credited with one. WatchPug found only three issues, but was upgraded to cmichel's reward, when cmichel should have gotten a much larger share. If cmichel had submitted each of the five separately, or my proposal for approaching bulks had been followed, he would have gotten the correct amount.

GalloDaSballo commented 2 years ago

@IllIllI000 Great that you found an actual instance that we can argue over.

At this point let's hear what @MrToph and @WatchPug have to say. Should CMichel have gotten 5 times the point and WatchPug 2 times the point?

jack-the-pug commented 2 years ago

Overall, I'm in favor of merging instead of submitting separate issues. Unless these issues are not for the exact same reason and therefore cannot be fixed using the same method.

Other instances of the issue (caused by the same reason) can usually be identified quite easily, once one of the instances is confirmed.

This applies not only to gas and low issues, but also med and high issues.

That's a bit like sometimes the same issue reported by many different wardens that one of which comes with a better write-up and others not so much. But as long as they are all clear enough, we should merge them and, according to current rules, give them equal rewards.

I'm also going to provide an instance here: in this issue, our report is being merged to this report.

My gut feeling says that it's not quite right.

Merging an issue with more instances with an issue with fewer instances is not more unfair than merging an issue with better write-up with worse ones.

As long as the description is clear enough. If they are all essentially the same issue, then they should be merged.

With that being said, I do in favor of rewarding better writeups (more comprehensive, listing more instances, clear PoC, even test script). That requires us to change the rules though.

I propose to give the main issue 15% more rewards than the dups.

MrToph commented 2 years ago

We need to talk about what issues can really be considered "the same issue" first. Then I'm in favor of submitting a single report for all occurences of the same issue. I would submit one report per (core issue, severity) grouping. Some examples:

Having said that, it's unfortunately often times up to interpretation what the underlying core issue is and depends on your level of granularity. If you squint hard enough, the single core issue of everything is that the code is incorrect. Attack impact (issue leads to broken withdraws in one contract, wrong borrow rate computation, wrong swap amount computation, receiving more rewards than intended, etc.) and attack classes (re-entrancy, replayable signatures, etc.) are probably the right level of detail to use.

Should CMichel have gotten 5 times the point and WatchPug 2 times the point?

I agree with how it was judged as for me it's the same underlying issue (not including the identity in the hash). Also fine that WatchPug got the same points even without mentioning the other code occurences because imo they correctly identified the core issue.

Because one of them bulked, each only get one.

This only works if we can be 100% sure that the judges always correctly identify the underlying core issue and make sure that all issues that are seen as duplicates / grouped together are indeed "the same". Otherwise, it can be very unfair and I can see many misjudgings coming from such a rule if not extremely careful. I am in support of this rule once we implemented the second judge having a final look idea / inconsistency bounties for wardens from idea the RFC.

My gut feeling says that it's not quite right.

It's not right because the other issue should have been invalid because no attack description / exploit imo. If they had given a correct attack description, marking as duplicate would be fine.

IllIllI000 commented 2 years ago

Thanks @MrToph, @jack-the-pug, and @GalloDaSballo this has been very helpful. I understand that I was too granular with my identifications of issues. I agree with watchpug that it would be good to reward the main issue, so people have an incentive to find all instances. If all instances aren't found, then wardens can just wait until the next contest and submit the ones they withheld.

kartoonjoy commented 9 months ago

Per the Autumn 2023 C4 Supreme Court verdicts, the Supreme Court's verdict on this issue is:

The findings are duplicates if they share the same root cause. More specifically, if fixing the Root Cause (in a reasonable manner) would cause the finding to no longer be exploitable, then the findings are duplicates.

Given the above, when similar exploits would demonstrate different impacts, the highest, most irreversible would be the one used for scoring the finding. Duplicates of the finding will be graded based on the achieved impact relative to the Submission Chosen for Report.

The above applies to duplicates from Bot Races, for which a rational fix has to be assumed.

Link to verdict: https://docs.google.com/document/d/1Y2wJVt0d2URv8Pptmo7JqNd0DuPk_qF9EPJAj3iSQiE/edit#heading=h.xsn8a35g2z5f