Standardize acceptance of reports based on automated findings

neumoxx commented 1 year ago

It is (still) unclear to most wardens and some judges which automatic findings can be reported to contests. Quoting this discussion:

Wardens may choose to use c4udit and other automated tools as a first pass, and are welcome to build on these findings by identifying high and medium severity issues. However, submissions based on these will have a higher burden of proof for demonstrating to sponsors a relevant hm exploit path in order to be considered satisfactory.

My understanding is that you cannot report these automatic findings without reasoning about how they are (under your opinion) a MED or a HIGH, and proving it. It's not the same reporting "There is no check that the treasury address is not zero" than reasoning about how this could lead to some stuck NFTs in the contract, for instance.

I got recently marked as unsatisfactory this report and I think that if this one is unsatisfactory, all findings built on automated findings should be rejected, because ultimately they are mitigated the same way (i.e.: requiring an address to be different than zero, using safeTransfer, etc.) regarding how elaborated the attack vector is.

But the ultimate goal here is to provide value to the sponsor. Automated findings, if they are all rejected, could lead the sponsor to not mitigate them if they are not aware of the impact they could have, and this is bad for both the sponsor and code4rena.

I'd like to see this point standardize, so it is clear to all parties (wardens, judges and sponsors) when a report based on an automatic finding is valid and when it is unsatisfactory.

IllIllI000 commented 1 year ago

I believe the original purpose was to cut down on low-effort submissions, so I would think that as long as the correct instances were identified and explained completely, it would be rewarded

kirk-baird commented 1 year ago

I think that if a submission is an exact copy of an automated tool then it should be deemed Unsatisfactory. The reason for this is high number of automated tool output are submitted to each contest to which most are invalid. We want to reduce spam output for both the judge and sponsor.

Automated tool should be used by wardens to help discover where bugs exist. If the tool discovers a bug then it is on the warden to describe it in more detail and analyse the impact with respect to the protocol in question when writing a submission.

Picodes commented 1 year ago

In my opinion (and it explains why I rejected your submission @neumoxx), the only way for a finding disclosed publicly in the automated report to be accepted is if it clearly demonstrates why there is more to this finding than the automatic finding.

For instance, in your case the automatic finding is Unsafe ERC20 operation, which means that there is a risk that this operation is not successful if performed with unconventional ERC20 tokens, and the finding you submitted explains this.

It does it way better than the automatic finding as it replaces the issue in the context of the contest, that is true, but there is no addition to what the tool found so I don't think it should be accepted.

Picodes commented 1 year ago

Also, we're discussing different things here:

as far as using slither, 4naly3er, or any tool as a warden to find issues, I do agree with @kirk-baird
then there is the automatic report published for each contest and in the README it is clearly stated "Anything included in the C4udit output is considered a publicly known issue and is ineligible for awards".

IllIllI000 commented 1 year ago

The tool flagged them as low-severity and also did not explain what specifically was unsafe and under what conditions. As is described in the original discussion post, the warden built on it to show a Medium rather than a Low, so they followed the directions. Perhaps @sockdrawermoney could weigh in on the original intention and the rules around this? There are other cases where the tool has false positives, so it can't be relied on to identify every issue, and wardens have to put in some work to filter things

neumoxx commented 1 year ago

@Picodes I know your reasons to reject my submission, and I'm ok with it as long as the rules are clear to everybody. As you stated, the README says that any submission referencing the automatic findings is not eligible, but at the same time the org discussion says what I quoted in my first message. As @IllIllI000 pointed out, the issue is flagged as Low in the automatic findings. This is why I asked in discord in the first place, because it was not clear to me if I should report this, because if the sponsor thinks it's a Low, maybe they do nothing about it and end up missing a more serious impact.

In the end I think we need a more uniform opinion on this matter, so wardens and judges don't waste time, and warning sponsors about the further impact automatic findings can have (in case they are all ineligible for rewards).

Picodes commented 1 year ago

@neumoxx yes I was just repeating my reasons for the public debate, I am totally open to changing my decision if you guys feel it's the thing to do :)

@IllIllI000 that's more an issue with the tool, to me it's clear that Unsafe ERC20 operation means that there is a possibility the operation is not successful, but I do agree that we could add an explanation in the tool and it could be labeled as high directly within the tool.

Then, to answer your comments @IllIllI000, ("I believe the original purpose was to cut down on low-effort submissions, so I would think that as long as the correct instances were identified and explained completely, it would be rewarded" + "There are other cases where the tool has false positives, so it can't be relied on to identify every issue, and wardens have to put in some work to filter things"), here to me you are referring to the global policy around using tools. But regarding the public report pinned by C4 in the readme, in my understanding, it was to definitely ban these findings from the contest to remove the burden of "easily catchable low-hanging fruits". So of course there are false positives in the report, but filtering these false positives isn't eligible for rewards.

neumoxx commented 11 months ago

Has this been standardized by C4 since I posted this in January? Or it still depends on who judges a contest? I'm asking because I have a quite impactful High finding in a contest, which appears in the automated findings. But there, it's only labelled as a Low, because the automatic findings only raise it as a recommendation. I can write a report detailing a path that could imply loss of funds to users, but I don't know if it's worth it. In the end, what concerns me is that, since it's labelled as Low, maybe the sponsor does not fix it, and this could mean big trouble for them in the future.

neumoxx commented 11 months ago

Anybody? Would be great to have someone from C4 or some judge to answer this. Thx. Btw, the issue is not that impactful, nor high, it's a medium and I'll probably report it anyway, but would be great to read your opinions.

MiloTruck commented 10 months ago

@neumoxx there's now a section in the docs about this issue:

Wardens may use automated tools as a first pass, and build on these findings to identify High and Medium severity issues ("HM issues"). However, submissions based on automated tools will have a higher burden of proof for demonstrating to sponsors a relevant HM exploit path in order to be considered satisfactory.

An example of an issue based on the automated findings being accepted recently:

https://github.com/code-423n4/2023-08-arbitrum-findings/issues/150

I think the general consensus is if you build upon the automated finding to demonstrate a more severe impact, it is considered in-scope and will be judged like any other finding.

neumoxx commented 10 months ago

Hi @MiloTruck , and thanks for responding. Your quote from the docs is almost the same I pasted in my original comment in this issue, has not changed much since January. In this thread there were some judges pro- accepting these issues and some against, that's why I asked if the opinion (from judges perspective) was more uniform now.

GalloDaSballo commented 10 months ago

The discussion is very delicate and I look forward to discussing this deeply

If you have example of edge cases, please link them here

My current "guiding principle" is the idea that multiple attacks can be chained, chaining attacks creates a new vulnerability, so the question is "how much" does the known finding cause the issue to be possible or not

The discussion is nuanced as we have to specifically decide if a known finding should be considered fixed or not, and whether the "common fix" would patch the new issue that chains it

neumoxx commented 10 months ago

Not really an edge case, but the issue why I started this thread https://github.com/code-423n4/2022-12-prepo-findings/issues/329 exemplifies my concern. Unsafe ERC20 operations are usually marked as low in the automatic/bot findings. I think if wardens find a situation/attack path that leads to a more severe impact, it should be accepted. Otherwise, imagine nobody reported the issue above, and the sponsor decides not to mitigate it because it's a low impact issue. This could lead to real trouble for the project.

I think we should be able to not mix submissions such as "safeTransferFrom should not be used because it does not check the return value of the call" and "Using transferFrom can lead to the total draining of the collateral of a contract", they're not the same.

In the end I think C4's judges (and the community in general) should have a uniform view on the validity of these kind of submissions.

MiloTruck commented 10 months ago

I think there are two schools of thought for this.

1. If a warden reports an automated finding but escalates it to a higher severity, it should be valid.

Some examples of this so far would be:

https://github.com/code-423n4/2022-12-prepo-findings/issues/329
https://github.com/code-423n4/2023-08-arbitrum-findings/issues/150
There's also one in the recent Maia DAO contest (issue #374), but I won't link it as it isn't public yet.

The main argument for this is that if a finding is flagged out as a low, the sponsor might choose not to fix since they aren't made aware of the actual impact behind the bug.

Since bots can't "contextualize" their findings, it is near impossible for them to accurately identify the severity of their bugs, which results in many potential high/mediums being reported as low.

Additionally, with how bot races are judged currently, automated reports just seem to be growing in their number of findings, with a lot of them being either low signal (not very impactful) or false positives. This creates a genuine concern that the sponsor might simply miss out on a potentially serious bug that is buried under all the low-signal ones.

My thoughts

The problem with allowing this is that it defeats the purpose of having automated findings in the first place, since wardens will just end up looking through bot reports and re-reporting what has already been found.

For example, unsafe ERC20-related findings are usually reported as low, but most of the time if you think hard enough there will probably be a medium impact somewhere. I really doubt we want to encourage wardens to be re-raising these findings in the actual contest itself.

2. If a warden "chains" an automated finding with another bug, it should be valid.

What separates this from 1 is essentially the root cause, the warden has to "combine" an automated finding with another bug to create a greater impact.

My thoughts

I actually haven't seen this occur in a contest, would be great if someone has an actual example...

Also, my opinion is that this should be allowed, since the warden is providing new "value" to the sponsor by identifying a new bug, even if it only has impact when chained with an automated finding.

I also don't think the automated finding should be considered fixed. Wardens find bugs and think of attack vectors based on the current version of the codebase; I don't see why judging should be any different. If the automated finding is considered fixed, you would essentially be judging findings based on a future version of the codebase.

neumoxx commented 10 months ago

In my opinion, both 1 and 2 should be accepted if reported.

The problem with allowing this is that it defeats the purpose of having automated findings in the first place, since wardens will just end up looking through bot reports and re-reporting what has already been found.

I have to disagree here. I participated in contests where, for the codebase in scope and the constraints (i.e: only certain ERC20 collateral accepted), not using safeTransfer did not pose a threat to the protocol. In those cases, the bot report is fine. It's a low: you should use safeTransfer as a good practice, but nothing happens if you don't. But in other cases, if wardens find ways that not using safeTransfer can raise the impact to MED/HIGH, I think it is good for the protocol to have them reported, even if it adds a lot of submissions.

CloudEllie commented 9 months ago

During the Autumn 2023 C4 Supreme Court session, the Supreme Court issued a set of advisory recommendations, which can be viewed here: https://docs.google.com/document/d/1Y2wJVt0d2URv8Pptmo7JqNd0DuPk_qF9EPJAj3iSQiE/edit#heading=h.vljs3xbfn8cj

IllIllI000 commented 9 months ago

However, the side effect of Bot Races is that the Sponsor has no agency in determining fixes for those findings, making it so that for some duplicates, raising severity would be useful, while other issues would never surface. I'm having a hard time interpreting the full meaning of this new sentence. How does the sponsor have no agency in determining fixes for bot races? What does it mean that issues would never surface - is this a reference to the raft scenario in the court pjqa discussion, or something else? The org issue is closed but it's not clear what the final outcome is, and what the ultimate position is for 'consultative' decisions. Is the court essentially saying that it's up to each individual contest judge to decide, in cases where the severity is different from the bot race severity?

CloudEllie commented 9 months ago

Re-opening this per input from @GalloDaSballo and other judges

code-423n4 / org