feat(`testing`): add mutation testing support

onbjerg commented 2 years ago

Mutation testing is used to design new tests and evaluate the quality of an existing test suite. In other words, it can tell you about the quality of your tests, where coverage only tells you about how much code you have not yet written a test for.

It works by changing the behavior of a program by changing parts of the code ("mutating it"), running the test suite with this new mutation and checking if the test suite passed. If the test suite passed, the mutation is considered to have survived - if not, it is considered to have died.

If dead / total is below some threshold, then there is room for improvement in the test suite. In other words, the more dead mutations you have, the better.

I think this would be a pretty cool tool to have in Foundry - and I think it is somewhat trivial to implement as well (simplified strategy):

Have a list of mutations we can do
Parse the source contracts
Insert mutation somewhere
Run tests
Repeat until we're satisfied

Here's a paper on mutation testing for Solidity that I found interesting: https://arxiv.org/pdf/2105.03626.pdf

onbjerg commented 2 years ago

More notes:

You can identify weakly tested pieces of code: mutations are not killed for these snippets
You can identify weak tests: they never kill mutants

Paper has a handy list of mutations we could implement

gakonst commented 2 years ago

Related DappTools PRs/Issues:

brockelmore commented 2 years ago

potential inspiration: https://github.com/JoranHonig/vertigo the paper: https://sci-hub.ru/downloads/2019-09-21/f6/10.1007@978-3-030-31500-9.pdf#chapter.19

has some good tips in it like knowing what test hits a particular mutation

we could do this by:

running all tests, build up a mapping of op -> Vec
perform mutation
run all tests for a mutated op

Vertigo works on solidity, but we may want to do part solidity, part raw opcodes so we can be faster and not have to recompile for all mutation types

onbjerg commented 2 years ago

alternatively we could sacrifice some disk space by copying the source of the project into cache/tmp/mutation-{1,2,3,n}, performing the mutation and in parallel spawn forge test --src ./cache/tmp/mutation-{1,2,3,n} (you get the idea).

if the command exits with status 0 the tests passed, otherwise they failed. afterwards we clean up by deleting those directories (or keep them around for diffing). if our cache works on hashes we theoretically(?) would only recompile the mutated contract anyway

JoranHonig commented 2 years ago

Yoyoyo, just seeing this thread now.

Not sure if this is possible already but if foundry could include the solc json ast in the compilation artifacts then it'll be relatively straightforward to create a Vertigo adapter for foundry.

alternatively we could sacrifice some disk space by copying the source of the project into cache/tmp/mutation-{1,2,3,n}, performing the mutation and in parallel spawn forge test --src ./cache/tmp/mutation-{1,2,3,n} (you get the idea).

if the command exits with status 0 the tests passed, otherwise they failed. afterwards we clean up by deleting those directories (or keep them around for diffing). if our cache works on hashes we theoretically(?) would only recompile the mutated contract anyway

That is close how Vertigo works right now (except that it doesn't create all of the mutations up front), and how the integration would work 👍

Vertigo works on solidity, but we may want to do part solidity, part raw opcodes so we can be faster and not have to recompile for all mutation types

There are some papers that evaluate this approach. From the top of my head there are two benefits to doing source code mutation:

Trivial Compiler Equivalence (basically free equivalent mutant detection)
Easier manual inspection of mutants (specifically equivalent mutations)

chandrakananandi commented 1 year ago

Hey all,

We have been working on a mutation generator called Gambit. It is implemented in Rust so it might be possible to integrate it with Forge easily.

Gambit is a mutation generator only and it works by traversing the Solidity AST to identify where to perform the mutations, then performs the mutations directly at the source level, and dumps them in a designated directory.

We are actively working on adding more functionality and better mutation generation when possible so let us know if you have any suggestions. PRs are also welcome :)

gakonst commented 1 year ago

This is great @chandrakananandi thank you for sharing!

The AST modification logic makes sense - cc @mattsse we could add that a mutate fn to the Project struct in ethers solc? Which would traverse all the tests and perform a random mutation? And we would run the the tests against the mutation inside of forge mutate?

sambacha commented 1 year ago

There is also Universal Simulator, one of the engineers from Trail of Bits maintains it -

here is some usage: https://github.com/sambacha/universalmutator/tree/new-solidity-rules

chandrakananandi commented 1 year ago

There is also Universal Simulator, one of the engineers from Trail of Bits maintains it -

here is some usage: https://github.com/sambacha/universalmutator/tree/new-solidity-rules

Indeed, it's very cool! The difference between Gambit and the universalmutator is that they are regex based whereas Gambit is guided by the AST and also runs the compiler to filter out any invalid mutants. On the other hand, universalmutator supports many more languages since their technqiue relies on regexes and they don't need any additional AST information! There is a cool paper on it too.

gakonst commented 1 year ago

Would be supportive of a draft PR or some kind of RFC proposing exactly how we'd integrate the two together, if you'd want to give it a shot @chandrakananandi!

mds1 commented 1 year ago

Hey @chandrakananandi, congrats on the announcement today!

Figured this might be a good time to revisit what the UX for what an MVP forge integration might look like. Previously we came up something like this:

forge test --mutate first runs all regular tests and makes sure they pass
- forge test --mutate skip-regular could skip regular tests for quicker iteration during development
If tests all pass, forge passes the remappings / config info to gambit, runs gambit against all src/ contracts, and saves mutation to a temp dir
For each mutation file in the tmp dir:
- Apply the change then run all tests
  - Maybe run tests with a hardcoded small amount of fuzz runs for now to prevent runtime from blowing up, and skip invariant tests? In general mutating solidity and having to recompile feels like it might be slow anyway, perhaps better suited as a CI-only type of integration
- We expect at least one test to fail for each applied mutation, so if a test failure happens the run is considered successful, if all tests still pass it's marked failed
  - Question here: What's the best way to handle benign mutations that are equivalent to the original?
For each failed run, print the mutation to the console, then delete the tmp dir
Like usual, exit with an error if tests fail (i.e. mutations were not detected by test suite)
None of gambit's config options are exposed to user, for simplicity

Follow up versions could:

Respect the [fuzz] seed like invariant tests do
Generate/link to the certora UI for visualizing results
Integrate better failure persistence (https://github.com/foundry-rs/foundry/issues/2551 and https://github.com/foundry-rs/foundry/issues/2552)
Expose gambit config options to user, perhaps via a [mutation] section the foundry.toml file so you can have different profile configs, like you can with [fuzz] and [invariant] sections

chandrakananandi commented 1 year ago

Hey @chandrakananandi, congrats on the announcement today!

Figured this might be a good time to revisit what the UX for what an MVP forge integration might look like. Previously we came up something like this:

* `forge test --mutate` first runs all regular tests and makes sure they pass

  * `forge test --mutate skip-regular` could skip regular tests for quicker iteration during development

* If tests all pass, forge passes the remappings / config info to gambit, runs gambit against all `src/` contracts, and saves mutation to a temp dir

* For each mutation file in the tmp dir:

  * Apply the change then run all tests

    * Maybe run tests with a hardcoded small amount of fuzz runs for now to prevent runtime from blowing up, and skip invariant tests? In general mutating solidity and having to recompile feels like it might be slow anyway, perhaps better suited as a CI-only type of integration
  * We expect at least one test to fail for each applied mutation, so if a test failure happens the run is considered successful, if all tests still pass it's marked failed

    * Question here: What's the best way to handle benign mutations that are equivalent to the original?

* For each failed run, print the mutation to the console, then delete the tmp dir

* Like usual, exit with an error if tests fail (i.e. mutations were not detected by test suite)

* None of gambit's config options are exposed to user, for simplicity

Follow up versions could:

* Respect the `[fuzz]` seed like invariant tests do

* Generate/link to the certora UI for visualizing results

* Integrate better failure persistence ([feat: fuzz failure persistence #2551](https://github.com/foundry-rs/foundry/issues/2551) and [feat: fuzz corpus saving and replay #2552](https://github.com/foundry-rs/foundry/issues/2552))

* Expose gambit config options to user, perhaps via a `[mutation]` section the foundry.toml file so you can have different profile configs, like you can with `[fuzz]` and `[invariant]` sections

Hi Matt @mds1!

Thanks for writing this out. The flow makes sense! One quick note: at the moment, Gambit actually already compiles the mutants to ensure that they are all valid. So the tmp directory you mentioned will only have mutants that compile. It is definitely slower but since we have been using the mutants in the context of verification so far, it's not a bottleneck yet. But in any case, we do have ongoing work right now as part of which are trying to use type information from the AST to generate mutants that are more likely valid.

As for the benign / equivalent mutant question: this is a very interesting problem and full on equivalence checking for mutants is yet another undecidable verification task. So our current thinking is that we should avoid generating benign mutants in the first place. For now, Gambit allows the user to specify what functions and contracts to mutate which is an easy way to avoid some useless mutants because hopefully the user has knowledge about the codebase we can leverage. We are actively working on other light weight strategies as well. There is also TCE (Trivial Compiler Equivalence) which people have used for this. The Certora Prover itself can be used here too but like I said, that's yet another verification task on its own.

For now, I am most interested in gathering data (lots of mutants, the result of running tests on them, some potentially manual analysis of mutants to understand how often we get equivalent mutants) and guiding our strategy for equivalence detection based on that. For the scale of data we will need for this, I think integrating with a testing framework like foundry would be really great!

Is there any dataset of solidity contracts together with foundry tests that you could point me to?

mds1 commented 1 year ago

at the moment, Gambit actually already compiles the mutants to ensure that they are all valid. So the tmp directory you mentioned will only have mutants that compile.

Ah interesting, wonder if there should be an option to disable that, since otherwise we'll be compiling twice (once for gambit to verify, then the forge compilation)? Then instead forge would need to discard a mutation that doesn't compile

Is there any dataset of solidity contracts together with foundry tests that you could point me to?

Are you just looking for repos that use foundry for building/testing? If so, the forge integration tests run against some external repos which could be a good starting point (some of those repos use dapptools, which foundry should be backwards compatible with). Then some other repos that use foundry include seaport, maple, alchemix, uniswap universal router, optimism's bedrock contracts, and a bunch of my own repos including multicall, solidity-generators, solidity-trigonometry, umbra. There's also a bunch in the awesome-foundry repo. But let me know if I misunderstood! 😅

chandrakananandi commented 1 year ago

Ah interesting, wonder if there should be an option to disable that, since otherwise we'll be compiling twice (once for gambit to verify, then the forge compilation)? Then instead forge would need to discard a mutation that doesn't compile

Yes this is definitely possible and in fact something we might do in an upcoming PR.

Are you just looking for repos that use foundry for building/testing? If so, the forge integration tests run against some external repos which could be a good starting point (some of those repos use dapptools, which foundry should be backwards compatible with). Then some other repos that use foundry include seaport, maple, alchemix, uniswap universal router, optimism's bedrock contracts, and a bunch of my own repos including multicall, solidity-generators, solidity-trigonometry, umbra. There's also a bunch in the awesome-foundry repo. But let me know if I misunderstood! 😅

This is wonderful, and exactly what I was looking for. I wanted to gather some benchmarks so that we can see how Gambit helps them. Thanks very much!

BenTheKush commented 1 year ago

Hey @mds1, thought I'd jump in! I'm currently working on Gambit PR that I'm hoping to land some time next week/the following Monday. After that I can look at integrating w/ Foundry.

WRT equivalent mutants: these are just hard! There are a few ways to handle them:

don't generate equivalent mutants begin with! This can be done a few ways:
- mutation operator selection: For instance, don't mutate a to a++ because if a is going out of scope the infected state cause by the post increment will never propagate to a test-visible location
- suppression rules: there are common patterns of equivalent mutants (e.g., int i = 0; if (cond) { i = 1; } else {i = 2;} where mutating the initialization value 0 will result in an unreadable write). These patterns depend on the language, coding styles within the language, and a number of other factors. I'm pretty new to solidity (I've mainly been researching in Java land), so I don't know what the best suppression rules are just yet!
detect equivalent mutants with an optimizing compiler: trivial compiler equivalence is a technique that just runs optimizing compilers on all the mutants and checks them for equality with the original program. Optimizing compilers tend to normalize or canonicalize programs, so that syntactically distinct programs with the same semantics often end up being written as the same binary. Depending on workload/if you are already compiling this can be feasible, but it blows up at scale.

There is also the problem of mutant redundancy. For instance, consider the boolean condition c and the following three mutants:

`c`	`!c`	`true`	`false`
`true`	`false` (X)	`true` (X)	`false`
`false`	`true` (X)	`true` (X)	`false`

I've marked all cells where the mutant differs from the original value with an (X). Note that !c is always different from c: this means that any test that can kill the 2nd mutant c -> true will also kill the first mutant c -> !c; however, the reverse is not true. This makes the second mutant strictly harder to kill than the first mutant, we say that true dominates !c, or that !c is redundant to the original program c.

Redundancy is bad: the more redundant the mutant is (that is, the easier it is to kill), the less utility it has for things like mutation score (killed/total) or as a testing goal for a developer.

As a trivial example, if a mutant always causes a crash it is very easy to detect: all tests that execute this mutant kill it! This is called a trivial mutant.

So avoiding equivalent mutants is really important, but trying to minimize redundancies is also critical.

All this to say, I'm hoping to improve Gambit's mutation routine to avoid many common patterns of equivalent mutants and to try to start reasoning about common mutant redundancies.

This brings me to Foundry! If we are able to integrate Gambit with Foundry we can use test data to compute the kill matrix (pretty badass, no?). This is just a binary matrix of "does test T kill mutant M?". We can use this data to approximate redundancy information and help to detect equivalent mutants. This is of course very expensive to compute, so I'd be using this to inspect manually to improve Gambit for future uses.

Anyway, that's my brain dump...excited to chat more once I land this PR!

BenTheKush commented 1 year ago

Hey all, poking around a bit in forge test related code to see how easy it would be to add a --mutate arg. Here is a quick overview of my understanding of how this should work, as well as a few questions that maybe I can get some guidance on.

Run Tests On Project: Ensure we have a passing test suite before mutation; abort otherwise.
Mutate Code: Apply mutation operators, resulting a mutated file set.
- Question: In my opinion mutation should be configurable (e.g., which files/contracts/functions to mutate, which mutation operators to apply). Using forge test --mutate makes this a little weird because the test subcommand now needs conf arguments that only apply in the presence of the --mutate option (e.g., --mutation-operators ...). Is it better to create a new forge mutation-test subcommand to encapsulate this functionality?
Run Tests On Mutants: The mutants are AST mutants, and are written to disk as source files. We will need to swap the mutated source in for the project source for each mutant, run testing infra, and then restore the original file for the next mutation run. This means we probably just want to copy the entire project structure to a temp directory and clean it up afterwards (this seems like the approach folks are suggesting above). And as is suggested, we could even make a few copies of the project to run in parallel:
- Question: Is it sufficient to recursively copy the directory tree rooted at project.paths.root to copy a project and point a new Project struct at it?
Report Mutation Run: There are various statistics we can report. The basic one is mutation score (detected mutants / generated mutants). The detection matrix is also useful for research purposes, and is just a T x M boolean matrix representation of the predicate test t detected mutant m

Test Coverage Data

Running tests is expensive, so avoiding test runs that don't ever touch mutated code is desirable. To do this I'd want coverage information.

Is test coverage data collected during execute_tests()?
If not, how hard would this be to gather?
Can we get per test coverage information?

I'm not sure if it's worth using test coverage in the initial PR, maybe I'll just get a proof of concept going. But I think incorporating coverage will be worth it in the medium to long term.

mds1 commented 1 year ago

Hey @BenTheKush! Here's some answers to your questions:

2: I think we might already have some conditional args like that and I believe clap supports that, so it should be ok. Deferring to @Evalir @mattsse to confirm. Note that we'll also need to support those options via the CLI and via the config file, I don't think they currently sync automatically (i.e. options need to be added in two places)

3: One downside here is that copying the whole directory might get costly (i.e. slow) because some deps can be quite large with a lot of unrelated files (e.g. when installing chainlink as a dep for their interfaces, you get tons of non-solidity code from the rest of the system). Since I don't think we need to generate mutants for deps, we might want to try only copying the src dir (i.e. set of files that may be mutated) and symlink to the rest. Not sure offhand if this will cause issues with remappings though, another @mattsse or @Evalir question. But otherwise yea I think that should be sufficient

Test coverage: Coverage data is only generated when running forge coverage, not when running forge test. Note that it also has some limitations right now (notably, requires no optimizer and no via-ir) and some bugs (can search the repo issues to learn more here), so ideally for now we'd leave mutant improvements due to coverage as an optional feature

BenTheKush commented 1 year ago

Thanks @mds1, yeah this all makes sense. Having conditional args would be perfect. And I agree, copying an entire directory over for mutation is not ideal. I might start there for a proof of concept/MVP and then see if there is something clever we can do to not have to copy everything. There's probably some clever manipulation of the Project struct I can do to get away with some stuff?

I'm not super familiar with the Solidity ecosystem so not sure what the stumbling blocks are...but I guess this is a great way to learn :)

WRT Coverage: happy to make it conditional! And again, I probably won't be implementing any optimizations in the first pass.

I'm going to do some requirements engineering this week and try to come up with a design. I think a lot of the changes will be happening on the Gambit side anyway, and I'll set up an API that forge test --mutate can call into.

MerlinEgalite commented 11 months ago

Hi there! I think mutation testing would be very useful to enhance testing suites. Did someone make progress on this topic since May?

chandrakananandi commented 11 months ago

@MerlinEgalite we had started investigating gambit + foundry but had other things come up. We have plans to resume this in early 2024. Will keep you posted.