crytic / echidna

Ethereum smart contract fuzzer
https://secure-contracts.com/program-analysis/echidna/index.html
GNU Affero General Public License v3.0
2.71k stars 357 forks source link

Improve echidna heuristics documentation #980

Open aviggiano opened 1 year ago

aviggiano commented 1 year ago

I am developing fuzzy.fyi, a project that helps execute long runs of echidna on AWS, testing it on a ERC4626 vault from Pods. The idea is to test smart contracts on the cloud without compromising the developer's workflow with intensive resource-consuming fuzzy campaigns.

While working on this tool, we stumbled upon some choices of parameters that influence the fuzzer performance, which don't seem to be very well documented. It seems that some of these parameters are "rules of thumb"/"heuristics", so I would like to ask them here:

  1. What are generally "good" testLimit and seqLen values? What values does Trail of Bits usually use during its audits? How long should a "good" run last (hours, days, weeks)?
  2. Assuming Trail of Bits performs long runs on the cloud, what is generally the best AWS instance for echidna? Another way to ask: is echidna constrained by CPU or RAM? Should I go after a CPU-optimized instance (such as c5) or a memory-optimized instance (such as r5)?
  3. Are there any benchmarks on echidna's performance vs hardware specifications? For example, if we pick a 2xlarge instance, should we expect half of the time to run a campaign from a xlarge instance?
  4. On what depends the choice of testLimit and seqLen? Meaning: when should you increase one or the other? How can we calculate the fuzzer "performance" (meaning, probability to find bugs), assuming the choice of these variables has an impact on the performance?
  5. When should we reutilize the corpus? Does it make sense to reuse the corpus if the contract interface changes? Does it make sense to reuse the corpus for different pull requests for the same project? Does it make sense to reuse the corpus from a different project?
  6. Is it better to test 10 runs with testLimit 100k and corpus enabled or test 1 run with testLimit 1M?
  7. Sometimes, long runs (testLimit 1M) are terminated by the OOM killer after many hours. What is the recommendation when that happens? I think getting a bigger instance would just hide the problem.
ggrieco-tob commented 1 year ago

Hi @aviggiano, thanks a lot of creating a new issue.

Sorry for the delay to get you some answers, let's take a look to your list of questions:

  1. Usually we run our fuzzing campaign for days or weeks, but depending on the complexity of your code, a few hours could be enough to get decent results. We generally use the default seqLen or increase it up to 300. (but expect to see an increment in memory usage!).

  2. Echidna used to be constrained by memory in large contracts, but we performed some optimizations, it is should be consume a lot less of memory. The changes are still unreleased. In that sense, echidna is limited by the CPUs, where you run different instances in parallel: check echidna-parade or wait until we merge multicore support.

  3. We don't have anything like that, I'm afraid.

  4. testLimit will only limit the amount of tries that the fuzzer will do, while seqLen should have a low impact in the probability of finding issues, except in extreme cases (this value is very small or there is a specific code that penalize to run certain transactions). Apart from that, what you ask regarding "fuzzer performance" as "probability to find bugs" is still an open question in research so it is not easy to answer :sweat_smile:. The most effective approach is shuffle parameters like echidna-parade will do using swarm testing.

  5. A. Does it make sense to reuse the corpus if the contract interface changes? If you adding new functions, you can re use it. If you remove it or modify a good number of them, you should start with a new corpus. B. Does it make sense to reuse the corpus for different pull requests for the same project? It depends on the PR, most likely yes, unless there are some radical changes. C. Does it make sense to reuse the corpus from a different project? Most likely no, unless you are testing a specific set of properties (ERC20). Check crytic-properties for that.

  6. It should be mostly the same, except if you need to have incremental results.

  7. Please test the latest echidna version from master (or wait for the upcoming release) since we fixed a lot of memory leaks.

aviggiano commented 1 year ago

@ggrieco-tob Thanks for the thorough response.

I can help out with number 3, if you think it's a good idea.

Since I already developed terraform templates to create instances with different sizes, measure CPU, memory, and elapsed time, it would be fairly easy to spawn a bunch of machines and take note of the results. The only problem is that I don't know which projects/configs to use. I could take some well known DeFi projects that already use echidna, such as Compound and Uniswap, for example, but if you have other recommendations that would be great.

aviggiano commented 1 year ago

Hi

Regarding the performance benchmark, I did some tests with Uniswap's V3 core contracts (testLimit equals to 100000) and was able to reach some interesting conclusions.

From what I was able to find out, until #963 gets merged, choosing a bigger instance with more cores does not yield significant results, as it doubles the cost but improves the test speed only slightly (see c5.large vs c5.xlarge).

Screenshot 2023-04-17 at 00 21 07

In fact, for this test, it seems like the cheaper the instance, the better. A t3.micro instance is only slightly slower than a c5.large, but much more cost-effective.

I will re-run these tests once multicore is available, and expand the benchmark to other projects in order to get a more comprehensive dataset than the current one.

ggrieco-tob commented 1 year ago

@aviggiano can you re-test these experiments but using #963?. We want to merge that one soon and we want to make sure it is solid.

aviggiano commented 1 year ago

@ggrieco-tob great news!

It seems like #963 really provides a significant boost in performance and cost-benefit:

chart

For the sake of simplicity, I've ran a single test (the longest one, TickBitmapEchidnaTest) with a single test configuration (testLimit equals to 100k). The dataset is here.

I will try to do the same experiment in the future with other codebases and other configurations.