falcosecurity / libs

libsinsp, libscap, the kernel module driver, and the eBPF driver sources
https://falcosecurity.github.io/libs/
Apache License 2.0
222 stars 161 forks source link

[DISCUSSION] [PLUGIN API] Notes from the Rust world #1672

Open gnosek opened 7 months ago

gnosek commented 7 months ago

Here are some notes I made while working on the Falco Plugin Rust SDK. Please note that they're not blockers for the Rust SDK itself (everything here has a workaround, better or worse). I hope that they can drive future API evolution and make it safer and easier to use. Please consider all suggestions with an implied "if it doesn't make life for other SDKs or the plugin framework too hard".

Strings without an explicit owner

There are a few strings that I found that do not have a specific struct as their owner, preventing usage of instance fields for the underlying storage. Basically, every API method that returns a pointer but does not take a pointer as an argument is potentially problematic.

The following lists the uglier cases I've found so far.

plugin_get_fields and plugin_get_init_schema

The issue here is that it's tempting to generate the resulting string automatically (writing JSON by hand is not exactly my definition of fun) and we do need to store the result somewhere.

It may be possible to generate these at compile time but it would require procedural macros which are fairly magical and opaque. My goal is to avoid proc macros so that the whole API can be discovered and autocompleted by the IDE and the compiler.

Proposed solution

Split allocation and initialization of the plugin into two phases and introduce a ss_plugin_t* parameter to all methods that do not take one now (it's less problematic for things like plugin_get_name since it's easier to hardcode their value at compile time, but we should do it for consistency anyway). So instead of:

the workflow would be:

plugin_get_last_error

The situation is somewhat different here as we do have an instance pointer, but it can be potentially accessed from multiple threads (given there are API methods that both take a ss_plugin_t* and allow multithreaded access).

Assuming that plugin_get_last_error is used for all error reporting, this implies that multiple threads can share the same ss_plugin_t* and its error message buffer, which requires a Mutex to prevent multiple threads from writing to the buffer concurrently.

More importantly, since we cannot return a reference with a lifetime attached, we cannot unlock the mutex whenever the caller is done with the error string. This means that the returned pointer must remain valid even after the mutex is unlocked. As a consequence, we may never reallocate the buffer, so we're forced to use a fixed-length array.

Now, to make sure a catastrophe won't happen with multiple threads reading and writing the error buffer, we need to zero-fill it before every write, so that it's always NUL-terminated, even while we're writing to it (note that non-Rust code in a different thread can easily read the error buffer via a pointer obtained whenever, ignoring the mutex completely).

Proposed solution

I feel it would be best to drop the whole concept of a plugin instance as a separate entity. If you want multiple plugin instances, just allocate multiple ss_plugin_t* instances. Then it's clear that every thread has its own error message buffer and can use it freely without locking.

ss_plugin_t* not safely accessible from ss_instance_t* methods

This again ties into thread safety and the Rust model of memory safety. A non-const pointer naturally maps to a &mut reference, but that is instant UB in multi-threaded Rust: multiple mutable references cannot even exist (unused) without causing UB and actually accessing an object through multiple mutable references from different threads is a recipe for disaster (including the worst case where everything works fine--for now).

Passing shared references to the plugin to instance methods doesn't help all that much: all other methods would need to take shared references as well, with all the overhead that it brings (every mutable field hidden behind a mutex, resulting in abysmal developer experience and performance penalty). So in the Rust SDK, I chose not to expose the ss_plugin_t* parameter to ss_instance_t* methods--the instance can pull in all the data in wants from the main plugin e.g. by Arc<T> (for read-only data) or Arc<Mutex<T>> (for mutable data).

Proposed solution

Again, kill the ss_instance_t* concept and just use multiple ss_plugin_t* instances.

Unclear thread-related lifetime rules

Is there a guarantee that whichever thread allocates a plugin or instance is responsible for freeing it later, or can allocations and frees happen in different threads (for the same object)?

Can a particular object be accessed from multiple threads? Sequentially? Concurrently?

This has consequences for the trait bounds for the plugin implementations that would e.g. disallow thread-local storage or other non-thread-safe constructs.

Proposed solution

Clarify the rules. It feels okay to guarantee that a particular object will only ever be accessed by the thread that created it, except for the unfortunate "instance accessing plugin" case (which the Rust SDK disallows).

If we dropped ss_instance_t* and accepted this rule, thread safety would become much simpler IMHO in the API: it would become entirely single-threaded for any particular object and any threading issues become internal to the plugin itself.

Unclear lifetime rules for strings (pointers) passed to plugins

For any pointer passed from the plugin API to the plugin, it would be nice to know how long it remains valid; basically: can I store the pointer or do I need to copy the pointee. The two obvious answers (different ones for different pointers) would be:

Proposed solution

Document the lifetime rules for all pointers.

Inconsistent support for data types

There are a few different uses for the various PT_*-described types:

The documentation suggests that table keys and values support different sets of types (involving the ss_plugin_field_t enum), but the method signatures themselves say that keys and values can use the same data types (basically: whatever is convertible to ss_plugin_state_data).

Values for extracted fields are set in ss_plugin_extract_field.res, which is a union nearly identical to ss_plugin_state_data*.

I realized that to support a data type we need two sets of conversion functions:

but the second conversion is ever so slightly different between table access and field extraction.

Proposed solution

Make ss_plugin_extract_field.res a pointer to ss_plugin_state_data.

Maybe we also want to make ss_plugin_state_data a tagged union (i.e. a struct that contains an enum describing which union variant is live and the union itself).

Note that we need something tricky for the v4/v6 combined fields as currently there's no way to extract a list of addresses due to their variable length. We can give up and use a pointer to a separately allocated struct but that introduces unnecessary allocations. Alternatively, we can use a fixed-size struct that would fit both address types, like:

struct v4_or_v6_addr
{
    uint32_t family; // AF_INET or AF_INET6
    union
    {
        uint32_t ipv4;
        uint8_t ipv6[16];
    };
}

Unclear endianness rules

I realize this is not an issue with the plugin API per se, but the endianness of values is not specified anywhere. x86-64 and aarch64 are both little-endian AFAIK but not all architectures are.

This creates its own version of hell when trying to deal with cross-endianness captures, so we might at least try to avoid issues with plugins and specify e.g. that all numeric fields in events are little-endian.

Note: I honestly have no idea how IP addresses are stored in events: native byte order? network byte order?

Proposed solution

Sit down and think very hard about endianness across the whole stack.

Alternatives

Leave the API as is.

gnosek commented 7 months ago

cc @jasondellaluce @Andreagit97

poiana commented 4 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 4 months ago

uhh, I missed that sorry! These are all valid points, i faced some of them while trying to implement a simple Rust SDK, not sure when/how we want to tackle them, but for sure this is something we need to keep in consideration in future plugin API reworks

Andreagit97 commented 4 months ago

/remove-lifecycle stale

Andreagit97 commented 4 months ago

I would also reconsider the multithread approach built by default inside the plugin API considering pros and cons:

gnosek commented 4 months ago

@Andreagit97 I was confused about the source plugin <-> instance relationship (it's always 1:1 it seems), so at a glance it feels like a threading rule being tl;dr "every object is always accessed from the thread that created it" would pretty much describe the current state.

Vaguely related, we could use some clarification around tables:

I have Ideas :tm: about extending the table support with a few methods that would let us factor just about anything inside sinsp into plugins but that's way out of scope here ;)

poiana commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 2 days ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten