Open onbjerg opened 2 years ago
More notes:
Paper has a handy list of mutations we could implement
Related DappTools PRs/Issues:
potential inspiration: https://github.com/JoranHonig/vertigo the paper: https://sci-hub.ru/downloads/2019-09-21/f6/10.1007@978-3-030-31500-9.pdf#chapter.19
has some good tips in it like knowing what test hits a particular mutation
we could do this by:
Vertigo works on solidity, but we may want to do part solidity, part raw opcodes so we can be faster and not have to recompile for all mutation types
alternatively we could sacrifice some disk space by copying the source of the project into cache/tmp/mutation-{1,2,3,n}
, performing the mutation and in parallel spawn forge test --src ./cache/tmp/mutation-{1,2,3,n}
(you get the idea).
if the command exits with status 0 the tests passed, otherwise they failed. afterwards we clean up by deleting those directories (or keep them around for diffing). if our cache works on hashes we theoretically(?) would only recompile the mutated contract anyway
Yoyoyo, just seeing this thread now.
Not sure if this is possible already but if foundry could include the solc json ast in the compilation artifacts then it'll be relatively straightforward to create a Vertigo adapter for foundry.
alternatively we could sacrifice some disk space by copying the source of the project into cache/tmp/mutation-{1,2,3,n}, performing the mutation and in parallel spawn forge test --src ./cache/tmp/mutation-{1,2,3,n} (you get the idea).
if the command exits with status 0 the tests passed, otherwise they failed. afterwards we clean up by deleting those directories (or keep them around for diffing). if our cache works on hashes we theoretically(?) would only recompile the mutated contract anyway
That is close how Vertigo works right now (except that it doesn't create all of the mutations up front), and how the integration would work 👍
Vertigo works on solidity, but we may want to do part solidity, part raw opcodes so we can be faster and not have to recompile for all mutation types
There are some papers that evaluate this approach. From the top of my head there are two benefits to doing source code mutation:
Hey all,
We have been working on a mutation generator called Gambit. It is implemented in Rust so it might be possible to integrate it with Forge easily.
Gambit is a mutation generator only and it works by traversing the Solidity AST to identify where to perform the mutations, then performs the mutations directly at the source level, and dumps them in a designated directory.
We are actively working on adding more functionality and better mutation generation when possible so let us know if you have any suggestions. PRs are also welcome :)
This is great @chandrakananandi thank you for sharing!
The AST modification logic makes sense - cc @mattsse we could add that a mutate fn to the Project struct in ethers solc? Which would traverse all the tests and perform a random mutation? And we would run the the tests against the mutation inside of forge mutate?
There is also Universal Simulator, one of the engineers from Trail of Bits maintains it -
here is some usage: https://github.com/sambacha/universalmutator/tree/new-solidity-rules
There is also Universal Simulator, one of the engineers from Trail of Bits maintains it -
here is some usage: https://github.com/sambacha/universalmutator/tree/new-solidity-rules
Indeed, it's very cool! The difference between Gambit and the universalmutator is that they are regex based whereas Gambit is guided by the AST and also runs the compiler to filter out any invalid mutants. On the other hand, universalmutator supports many more languages since their technqiue relies on regexes and they don't need any additional AST information! There is a cool paper on it too.
Would be supportive of a draft PR or some kind of RFC proposing exactly how we'd integrate the two together, if you'd want to give it a shot @chandrakananandi!
Hey @chandrakananandi, congrats on the announcement today!
Figured this might be a good time to revisit what the UX for what an MVP forge integration might look like. Previously we came up something like this:
forge test --mutate
first runs all regular tests and makes sure they pass
forge test --mutate skip-regular
could skip regular tests for quicker iteration during developmentsrc/
contracts, and saves mutation to a temp dirFollow up versions could:
[fuzz]
seed like invariant tests do[mutation]
section the foundry.toml file so you can have different profile configs, like you can with [fuzz]
and [invariant]
sectionsHey @chandrakananandi, congrats on the announcement today!
Figured this might be a good time to revisit what the UX for what an MVP forge integration might look like. Previously we came up something like this:
* `forge test --mutate` first runs all regular tests and makes sure they pass * `forge test --mutate skip-regular` could skip regular tests for quicker iteration during development * If tests all pass, forge passes the remappings / config info to gambit, runs gambit against all `src/` contracts, and saves mutation to a temp dir * For each mutation file in the tmp dir: * Apply the change then run all tests * Maybe run tests with a hardcoded small amount of fuzz runs for now to prevent runtime from blowing up, and skip invariant tests? In general mutating solidity and having to recompile feels like it might be slow anyway, perhaps better suited as a CI-only type of integration * We expect at least one test to fail for each applied mutation, so if a test failure happens the run is considered successful, if all tests still pass it's marked failed * Question here: What's the best way to handle benign mutations that are equivalent to the original? * For each failed run, print the mutation to the console, then delete the tmp dir * Like usual, exit with an error if tests fail (i.e. mutations were not detected by test suite) * None of gambit's config options are exposed to user, for simplicity
Follow up versions could:
* Respect the `[fuzz]` seed like invariant tests do * Generate/link to the certora UI for visualizing results * Integrate better failure persistence ([feat: fuzz failure persistence #2551](https://github.com/foundry-rs/foundry/issues/2551) and [feat: fuzz corpus saving and replay #2552](https://github.com/foundry-rs/foundry/issues/2552)) * Expose gambit config options to user, perhaps via a `[mutation]` section the foundry.toml file so you can have different profile configs, like you can with `[fuzz]` and `[invariant]` sections
Hi Matt @mds1!
Thanks for writing this out. The flow makes sense! One quick note: at the moment, Gambit actually already compiles the mutants to ensure that they are all valid. So the tmp
directory you mentioned will only have mutants that compile. It is definitely slower but since we have been using the mutants in the context of verification so far, it's not a bottleneck yet. But in any case, we do have ongoing work right now as part of which are trying to use type information from the AST to generate mutants that are more likely valid.
As for the benign / equivalent mutant question: this is a very interesting problem and full on equivalence checking for mutants is yet another undecidable verification task. So our current thinking is that we should avoid generating benign mutants in the first place. For now, Gambit allows the user to specify what functions and contracts to mutate which is an easy way to avoid some useless mutants because hopefully the user has knowledge about the codebase we can leverage. We are actively working on other light weight strategies as well. There is also TCE (Trivial Compiler Equivalence) which people have used for this. The Certora Prover itself can be used here too but like I said, that's yet another verification task on its own.
For now, I am most interested in gathering data (lots of mutants, the result of running tests on them, some potentially manual analysis of mutants to understand how often we get equivalent mutants) and guiding our strategy for equivalence detection based on that. For the scale of data we will need for this, I think integrating with a testing framework like foundry would be really great!
Is there any dataset of solidity contracts together with foundry tests that you could point me to?
at the moment, Gambit actually already compiles the mutants to ensure that they are all valid. So the tmp directory you mentioned will only have mutants that compile.
Ah interesting, wonder if there should be an option to disable that, since otherwise we'll be compiling twice (once for gambit to verify, then the forge compilation)? Then instead forge would need to discard a mutation that doesn't compile
Is there any dataset of solidity contracts together with foundry tests that you could point me to?
Are you just looking for repos that use foundry for building/testing? If so, the forge integration tests run against some external repos which could be a good starting point (some of those repos use dapptools, which foundry should be backwards compatible with). Then some other repos that use foundry include seaport, maple, alchemix, uniswap universal router, optimism's bedrock contracts, and a bunch of my own repos including multicall, solidity-generators, solidity-trigonometry, umbra. There's also a bunch in the awesome-foundry repo. But let me know if I misunderstood! 😅
Ah interesting, wonder if there should be an option to disable that, since otherwise we'll be compiling twice (once for gambit to verify, then the forge compilation)? Then instead forge would need to discard a mutation that doesn't compile
Yes this is definitely possible and in fact something we might do in an upcoming PR.
Are you just looking for repos that use foundry for building/testing? If so, the forge integration tests run against some external repos which could be a good starting point (some of those repos use dapptools, which foundry should be backwards compatible with). Then some other repos that use foundry include seaport, maple, alchemix, uniswap universal router, optimism's bedrock contracts, and a bunch of my own repos including multicall, solidity-generators, solidity-trigonometry, umbra. There's also a bunch in the awesome-foundry repo. But let me know if I misunderstood! 😅
This is wonderful, and exactly what I was looking for. I wanted to gather some benchmarks so that we can see how Gambit helps them. Thanks very much!
Hey @mds1, thought I'd jump in! I'm currently working on Gambit PR that I'm hoping to land some time next week/the following Monday. After that I can look at integrating w/ Foundry.
WRT equivalent mutants: these are just hard! There are a few ways to handle them:
don't generate equivalent mutants begin with! This can be done a few ways:
a
to a++
because if a
is going out of scope the infected state cause by the post increment will never propagate to a test-visible locationint i = 0; if (cond) { i = 1; } else {i = 2;}
where mutating the initialization value 0
will result in an unreadable write). These patterns depend on the language, coding styles within the language, and a number of other factors. I'm pretty new to solidity (I've mainly been researching in Java land), so I don't know what the best suppression rules are just yet!detect equivalent mutants with an optimizing compiler: trivial compiler equivalence is a technique that just runs optimizing compilers on all the mutants and checks them for equality with the original program. Optimizing compilers tend to normalize or canonicalize programs, so that syntactically distinct programs with the same semantics often end up being written as the same binary. Depending on workload/if you are already compiling this can be feasible, but it blows up at scale.
There is also the problem of mutant redundancy. For instance, consider the boolean condition c
and the following three mutants:
c |
!c |
true |
false |
---|---|---|---|
true |
false (X) |
true (X) |
false |
false |
true (X) |
true (X) |
false |
I've marked all cells where the mutant differs from the original value with an (X). Note that !c
is always different from c
: this means that any test that can kill the 2nd mutant c -> true
will also kill the first mutant c -> !c
; however, the reverse is not true. This makes the second mutant strictly harder to kill than the first mutant, we say that true
dominates !c
, or that !c
is redundant to the original program c
.
Redundancy is bad: the more redundant the mutant is (that is, the easier it is to kill), the less utility it has for things like mutation score (killed/total) or as a testing goal for a developer.
As a trivial example, if a mutant always causes a crash it is very easy to detect: all tests that execute this mutant kill it! This is called a trivial mutant.
So avoiding equivalent mutants is really important, but trying to minimize redundancies is also critical.
All this to say, I'm hoping to improve Gambit's mutation routine to avoid many common patterns of equivalent mutants and to try to start reasoning about common mutant redundancies.
This brings me to Foundry! If we are able to integrate Gambit with Foundry we can use test data to compute the kill matrix (pretty badass, no?). This is just a binary matrix of "does test T kill mutant M?". We can use this data to approximate redundancy information and help to detect equivalent mutants. This is of course very expensive to compute, so I'd be using this to inspect manually to improve Gambit for future uses.
Anyway, that's my brain dump...excited to chat more once I land this PR!
Hey all, poking around a bit in forge test
related code to see how easy it would be to add a --mutate
arg. Here is a quick overview of my understanding of how this should work, as well as a few questions that maybe I can get some guidance on.
Run Tests On Project: Ensure we have a passing test suite before mutation; abort otherwise.
Mutate Code: Apply mutation operators, resulting a mutated file set.
forge test --mutate
makes this a little weird because the test
subcommand now needs conf arguments that only apply in the presence of the --mutate
option (e.g., --mutation-operators ...
). Is it better to create a new forge mutation-test
subcommand to encapsulate this functionality?Run Tests On Mutants: The mutants are AST mutants, and are written to disk as source files. We will need to swap the mutated source in for the project source for each mutant, run testing infra, and then restore the original file for the next mutation run. This means we probably just want to copy the entire project structure to a temp directory and clean it up afterwards (this seems like the approach folks are suggesting above). And as is suggested, we could even make a few copies of the project to run in parallel:
project.paths.root
to copy a project and point a new Project
struct at it?Report Mutation Run: There are various statistics we can report. The basic one is mutation score (detected mutants / generated mutants). The detection matrix is also useful for research purposes, and is just a T x M boolean matrix representation of the predicate test t
detected mutant m
Test Coverage Data
Running tests is expensive, so avoiding test runs that don't ever touch mutated code is desirable. To do this I'd want coverage information.
execute_tests()
?I'm not sure if it's worth using test coverage in the initial PR, maybe I'll just get a proof of concept going. But I think incorporating coverage will be worth it in the medium to long term.
Hey @BenTheKush! Here's some answers to your questions:
2: I think we might already have some conditional args like that and I believe clap supports that, so it should be ok. Deferring to @Evalir @mattsse to confirm. Note that we'll also need to support those options via the CLI and via the config file, I don't think they currently sync automatically (i.e. options need to be added in two places)
3: One downside here is that copying the whole directory might get costly (i.e. slow) because some deps can be quite large with a lot of unrelated files (e.g. when installing chainlink as a dep for their interfaces, you get tons of non-solidity code from the rest of the system). Since I don't think we need to generate mutants for deps, we might want to try only copying the src dir (i.e. set of files that may be mutated) and symlink to the rest. Not sure offhand if this will cause issues with remappings though, another @mattsse or @Evalir question. But otherwise yea I think that should be sufficient
Test coverage: Coverage data is only generated when running forge coverage
, not when running forge test
. Note that it also has some limitations right now (notably, requires no optimizer and no via-ir) and some bugs (can search the repo issues to learn more here), so ideally for now we'd leave mutant improvements due to coverage as an optional feature
Thanks @mds1, yeah this all makes sense. Having conditional args would be perfect. And I agree, copying an entire directory over for mutation is not ideal. I might start there for a proof of concept/MVP and then see if there is something clever we can do to not have to copy everything. There's probably some clever manipulation of the Project
struct I can do to get away with some stuff?
I'm not super familiar with the Solidity ecosystem so not sure what the stumbling blocks are...but I guess this is a great way to learn :)
WRT Coverage: happy to make it conditional! And again, I probably won't be implementing any optimizations in the first pass.
I'm going to do some requirements engineering this week and try to come up with a design. I think a lot of the changes will be happening on the Gambit side anyway, and I'll set up an API that forge test --mutate
can call into.
Hi there! I think mutation testing would be very useful to enhance testing suites. Did someone make progress on this topic since May?
@MerlinEgalite we had started investigating gambit + foundry but had other things come up. We have plans to resume this in early 2024. Will keep you posted.
Hi @chandrakananandi, I'd like to work on integrating Gambit and Foundry
This is the doc about integrating Foundry and Gambit. Please feel free to drop your comments
https://hackmd.io/@tICezjHsSaiehIn9jbcUAA/rJXZb_JS6
@chandrakananandi Does Certora have any plans on open-sourcing the UI you demoed at your talk at SoliditySummit?
The UI source is actually open-source: https://github.com/Certora/gambit-report-viewer but note that it was not implemented by Certora and isn't really regularly maintained at the moment. We have plenty of ideas for how to improve it but we don't have resources to invest in it right now.
@chandrakananandi I just realized you dropped some comments on the design doc. I have responded.
Are you ok with joining a Telegram group if I create one to sync on this?
If anyone would like to join a Telegram group to find out about progress on this or give feedback on the design doc here is a link https://t.me/+a_SIusNyUMkzNDI8
The group will be deleted once this is accomplished
Right now I'm at the stage of testing the generated mutants and generating a mutation test report.
Progress updates on this task can be found here https://hackmd.io/@tICezjHsSaiehIn9jbcUAA/SkTEyvuHa
We now have an alpha working implementation of integrating Gambit with Foundry. If you have any questions please feel free to join the TG group posted above and ask.
@chandrakananandi If you could join the Telegram group posted above that would be awesome.
Also, how open is the Certora team to allowing changes to Gambit?
@samparsky Thanks so much for the updates! I responded to all the points in the doc: https://hackmd.io/@tICezjHsSaiehIn9jbcUAA/SkTEyvuHa
I am not the best at keeping up with telegram and other apps but I will join it!
I am certainly open to accepting PRs for gambit for improving the tool!
Mutation testing is used to design new tests and evaluate the quality of an existing test suite. In other words, it can tell you about the quality of your tests, where coverage only tells you about how much code you have not yet written a test for.
It works by changing the behavior of a program by changing parts of the code ("mutating it"), running the test suite with this new mutation and checking if the test suite passed. If the test suite passed, the mutation is considered to have survived - if not, it is considered to have died.
If
dead / total
is below some threshold, then there is room for improvement in the test suite. In other words, the more dead mutations you have, the better.I think this would be a pretty cool tool to have in Foundry - and I think it is somewhat trivial to implement as well (simplified strategy):
Here's a paper on mutation testing for Solidity that I found interesting: https://arxiv.org/pdf/2105.03626.pdf