Adding syscall semantics fuzzing -- beyond thread interleavings

rrnewton commented 1 year ago

In its initial release, hermit run --chaos is focused on exploring different thread interleavings, and of course it also provides control over RNG. But thread interleavings & RNG are not the only sources of nondeterminism in Linux.

This issue: Exercising other syscall's nondeterminism

There are many places where the Linux syscall semantics expose nondeterministic outcomes. Each of these is a candidate for fuzzing user space (i.e. acting as a Fuzzy Linux by misbehaving and exercising). This is a task to add fuzzing of these system calls as well, for a more complete and aggressive --chaos mode.

Here is a check list of different syscalls we plan to make fuzzy.

[ ] read/write: how many bytes of IO are performed
[x] futex: which threads to wake on futex_wake (--fuzz-futexes)
[ ] mmap: address space returned (e.g. ASLR)
[ ] all syscalls: returning extra EINTRs or other error conditions

N.B. All of them will be controlled by the same source of randomness (--fuzz-seed), which is separate from --sched-seed and --rng-seed, allowing these dimensions to be controlled individually. We could go further and separate seeds for each of the above if we liked.

Out of scope

Also, there are related topics --- additional dimensions worth fuzzing in their own right for correctness stress testing -- that are beyond the scope of this issue:

adding network delay
dropping network connections

cameronelliott commented 6 months ago

Whoa! Cool to find this issue!

I just discovered both Hermit & Reverie, and I must say "very cool stuff!"

I was thinking about how to use both as an alternative to in-codebase deterministic simulation with fault-injecting testing. The possibility of using Hermit & Reverie with a lot less effort spent on building an in-codebase simulator is quite exciting!

My main interest is around I/O fault-injection: network & disk, beyond the scheduler support in 'chaos'.

I was thinking about how to go about it. Clearly Hermit could be extended to do it. I was also wondering if Hermit + Reverie-Chaos could be used together in order to avoid touching Hermit (despite perf concerns). So, before I found this issue, I tried it.

But that's not going to work:

c@intel12400 ~/reverie (main)>
~/hermit/target/debug/hermit run ~/reverie/target/debug/chaos cat LICENSE
WARNING: --preemption-timout requires hardware perf counters which is not supported on this host, resetting preemption-timeout to 0
thread 'main' panicked at reverie-examples/chaos.rs:186:25:
Failed building the Runtime: Os { code: 9, kind: Uncategorized, message: "Bad file descriptor" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

But it was a worthwhile experiment, right? :smile:

So, given this github issue expressing interest in expanding the chaos fuzzing, and the evidence Reverie can't be layered upon Reverie, it seems like Hermit is the best place to think about putting this type of non-determinism or fault-injection?

Some of the events I am interested in, and may explore:

TCP stuff like: dropped connections, HostUnreachable-err, ConnectionRefused-err, etc
Disk I/O: read/write failures etc.

That's it, I'm just exploring so far, but I though maybe it made sense to say hello. Thanks

VladimirMakaev commented 6 months ago

Hi @cameronelliott

Just to let you know we're not actively working on Hermit in the team but we should be able to merge contributions if you choose to send some. However expect very limited guidance on our end since it's purely on voluntary basis.

I was also wondering if Hermit + Reverie-Chaos could be used together in order to avoid touching Hermit (despite perf concerns). So, before I found this issue, I tried it

I don't think this is possible since Reverie is based on ptrace you can't layer those. There is reverie-sabre which is based on user mode interception but it is not feature complete as far as I know. You can explore that too if you're up to.

My main interest is around I/O fault-injection: network & disk

This is probably just missing bits that need to be implemented in Hermit itself. But just to be clear there are missing bits in various parts of Hermit, e.g. not all syscalls are handled deterministically, or replayed deterministically, but there is good amount of programs that work correctly. You can get and idea of what is working by looking at the tests

WARNING: --preemption-timout requires hardware perf counters which is not supported on this host, resetting preemption-timeout to 0

I've noticed that you have this warning. Hermit is very sensitive to hardware feature, so if you're running it on VM it needs to support hardware perf counters otherwise it won't work properly. I'm recommending working on a Linux installed on baremetal. Things like Docker and WSL won't work and you might get hard time figuring out what's going on.

Hope this helps you to get started

cameronelliott commented 6 months ago

Hey @VladimirMakaev, thanks for the update on the status of Hermit. Thank you also for the pointer to the tests, that is helpful to know.

In spite of the 'sleep' status of Hermit I might still explore using it as a tool for deterministic simulation plus fault injection. At least I know the risks now. 🙃

It really seems like a one-of-a-kind tool with great potential for simulation and fuzzing.

I came across Hermit due to the announcement by Antithesis.com, which is a proprietary tool. (But kudos to them for making it work as a business! That's good news for software in general and deterministic sim testing)

@VladimirMakaev Let me just ask anyway, you know of any projects which are more active than Hermit, which are open source and comparable to Hermit that could be the basis of a tool to do deterministic-sim and fuzzing/fault-injection?

facebookexperimental / hermit

Adding syscall semantics fuzzing -- beyond thread interleavings #34

This issue: Exercising other syscall's nondeterminism

Out of scope