Re-implement libcannoli as a TCG plugin

stsquad commented 2 years ago

As the only two things you want to plugin into are instruction execution and memory accesses you could use the existing TCG plugin infrastructure for your backend and remove the dependency on patching QEMU itself. See https://qemu.readthedocs.io/en/latest/devel/tcg-plugins.html for some examples and the API.

novafacing commented 2 years ago

Have to +1 here, I'm fixing yet another problem with the patch right now :( that said, I suspect a tcg plugin is going to be a lot slower than the current implementation.

gamozolabs commented 2 years ago

I'm familiar with the TCG plugin APIs. Unfortunately, in this case the goal of this project is much different than TCG hooks. It's focused entirely on real-time tracing and logging of a process with minimal overhead. This project is not meant to replace standard hooking/plugin/callback models, as those offer a significantly simpler development environment, however, significantly less performance.

Further, some of the hooks that I need (eg, mmap() and munmap()) would still require patches, or some more sophisticated hooks (like actually inferring the syscalls based on execution flows, which would ruin performance).

For those not concerned with performance. Either writing traditional plugins to QEMU, or using something like Unicorn makes way more sense. This is designed for situations where those are simply too slow.

Unless there's a way for me to inject raw x86 instructions into the JIT stream and I'm missing it. The original implementation actually used TCG ops in the stream, but the performance there was still not sufficient for the use cases here. TCG is very memory heavy and I was unable to get it to emit efficient JIT, even with direct patch-based hooks.

novafacing commented 2 years ago

@gamozolabs that makes sense, I've also experimented with TCG plugins in the past and had very little success getting anything approaching "fast enough" (not to mention, there is a lot of stuff intentionally not exposed^1)

gamozolabs commented 2 years ago

I'm always open to new ideas for how I hook into QEMU though. And there are definitely rough edges inside Cannoli (I should use an assembler rather than doing the weird templatey assembly blocks that I do). I haven't found any better ways for the patch unfortunately. It seems it just needs pretty frequent updates... which honestly I didn't anticipate to be this unstable when I originally designed it. It's a bit of a pain, but honestly it maybe is a 1-5 minute hiccup for me every month or so.

novafacing commented 2 years ago

For me the main headache with the patch is just using git am instead of patch -p1, maybe it would make sense to just suggest using patch -p1 in the readme instead? It's really easy to fix the patch when mainline QEMU breaks it but git keeps being weird about sha1 hashes and such for creating the fake commits. Just an idea :)

stsquad commented 2 years ago

@gamozolabs that makes sense, I've also experimented with TCG plugins in the past and had very little success getting anything approaching "fast enough" (not to mention, there is a lot of stuff intentionally not exposed^1)

I'm confused about the not fast enough. I had a brief look over the patch and I couldn't see how it makes the speed up. Plugin callbacks are pretty much as fast as they can be inline with the generated JIT code - how does your patch differ? We could certainly consider faster inline approaches - we already have PLUGIN_CB_INLINE and use it for simple counter incrementation. We will never expose TCG ops directly to the plugin though.

novafacing commented 2 years ago

As a follow-up to this, I've gone ahead and gone a little bit past proof-of-concept developing https://github.com/novafacing/cannonball. Basically, it's what I would make if Cannoli were a QEMU plugin, and it is indeed much slower by a couple orders of magnitude (there are some optimizations I could do, but it won't make a 100x difference, maybe 10x). That said, I think it's a little easier to work with, so as usual it's not a case of better or worse just a case of tradeoffs.

I think we can call this issue solved!

stsquad commented 2 years ago

Can I see the PoC to get an idea of what you are trying to do so I can inform future development directions?

novafacing commented 2 years ago

@stsquad yep! The code is in the link above, theoretically the build instructions with meson will work but I admittedly haven't tested much. The QEMU C plugin portion is in: https://github.com/novafacing/cannonball/tree/main/src, and I've tried to replicate cannoli as closely as possible (down to having the plugin pass things to rust code via FFI, and then let the rust code pass things over in this case a unix socket to a consumer).

As far as informing development direction for qemu plugins, what I'd really like is to just have access to the CPU pointer without having to patch QEMU, but I know that's a pretty contentious request. Failing that, request functions to access guest registers or memory would be really nice :) The lack of availability of reading guest state from a plugin is pretty much the main reason every binary analysis project has to maintain a fork of QEMU (see panda, shellphish-qemu, cannoli via patch, qemuafl, etc), so it would be really awesome to come to a solution there.

gamozolabs commented 1 year ago

Closing as this is not feasible with the performance and control needed.

MarginResearch / cannoli

Re-implement libcannoli as a TCG plugin #7