falcosecurity / libs

libsinsp, libscap, the kernel module driver, and the eBPF driver sources
https://falcosecurity.github.io/libs/
Apache License 2.0
212 stars 158 forks source link

eBPF vs. KMod Performance #267

Open Stringy opened 2 years ago

Stringy commented 2 years ago

As requested on the community call, this issue is for discussions around the relative performance of the eBPF and Kernel Module drivers. The attached document details some of my initial findings and the approach we're looking to take at RedHat to improve performance

Falco eBPF & Kernel Module Performance.pdf

leogr commented 2 years ago

/kind documentation

@Stringy thank you :pray:

Andreagit97 commented 2 years ago

Hi @Stringy, first of all, thank you for this amazing work! Really impressive analysis! I have just a few curiosities regarding the optimization you proposed in the document.

  1. As far as I know, the new tracing technologies (tp_btf) do not provide hooks to single syscalls but only to the usual sys_enter sys_exit. These new tp_btf programs are the evolution of the raw_tp we use today, which already do not allow instrumenting the single syscall. I completely agree that instrumenting only the specific syscall brings us a good benefit as the kernel works less, but unfortunately, this approach seems to be not compatible with new technologies. Obviously, this is just my opinion, it may be that this is not the direction in which we are going with the bpf world. I don't know if you have already considered this aspect or you've addressed it in some way.

  2. This second point is strictly related to the first one. In my opinion, the huge optimization you obtained, for example, with the write syscall, is due to two main factors:

    1. The consistent decrease in the number of syscalls analyzed and, therefore in the number of events involved.
    2. The use of simple tracepoints to catch the specific syscall, instead of using sys_enter sys_exit hooks. With this approach, the kernel doesn't wake up for all uninteresting syscalls.

      It would be interesting to understand from which of the two factors comes the greatest benefit. I think it mainly derives from the fact that the number of traced syscalls is significantly reduced, but I would like to understand how impactful it is to call a useless bpf program for all syscalls that we don't consider.

  3. The kernel simple consumer approach tries to significantly reduce the number of evaluated syscalls. Let's say that the idea is similar to the one you proposed, the only differences are:

    1. The simple consumer mode captures all syscalls which are useful for tracing the state of the system, not only 17 syscalls as in your case.
    2. We use raw_tp programs + syscall_enter syscall_exit hooks instead of specific ones like tracepoint/raw_syscalls/sys_write.

An interesting reflection arises from these 3 points... If we push the simple consumer to the limit by filtering all the syscalls except the 17 you have selected and keep the current instrumentation with sys_enter and sys_exit, how does the performance behave? This type of test could be the answer to the question in point 2 and it could be a turning point! If numbers are comparable, we could refine the simple consumer logic to instrument only the requested syscalls without changing our code too much. Have you already tried the simple consumer approach in a similar situation?

Not to mention that the simple consumer filtering logic (currently made with flags at run-time) could be easily moved to load-time, directly modifying the tail table! Just to be clearer, we could obtain something like this in our bpf programs:

SEC("raw_tp/sys_enter")
int catch_syscall_enter_event(struct pt_regs *regs, long id)
{
    /* Get the syscall id from the context. */
    uint32_t syscall_id = id;

    /* Check if we are on 32 bit architecture. */
    if(check_ia32()){ return 0; }

    /* Call the event-specific program in tail call (IF PRESENT, OTHERWISE TERMINATE HERE). */
    bpf_tail_call(ctx, &syscall_enter_tail_table, syscall_id);
}

The filtering job is made by userspace at load-time, changing the tail table content! A similar approach should considerably decrease the overhead due to hooking syscall_exit and syscall_enter. All this assumes that the overhead is not huge. In that case, we should consider a solution with specific tracepoints as you suggested.

These are just some thoughts, if anyone else has any ideas about this topic, I would be very happy to hear them! Anyway thank you for your time @Stringy, and again, amazing work!

Stringy commented 2 years ago

Hey @Andreagit97 thanks for the questions/comments! I've been doing some additional testing today to try to answer some of them.

  1. I must admit I'm not as aware/familiar with the eBPF ecosystem as I'd like to be, and I certainly didn't realise that these kinds of specific syscall tracepoints may not be supported going forward. Do you have any resources about this? I'm super interested to learn more.

  2. This is what I've been measuring today to try to answer this question. Initial measurements have shown that a useless program (one that just returns) has a very small (negligible) performance impact. This was somewhat surprising to me, since I had believed that it was in the execution of the BPF program that caused most of the overhead, and is why I went down the route of ensuring the kernel didn't execute on every syscall.

However, I've also been measuring with a full sys_enter/sys_exit driver that filters out unwanted syscalls as early as possible. (as in your simple consumer example, it filters just after the ia32 check.) So far this has shown a very similar overhead to my previous measurements in the document above. I'm gathering more data around this to make sure I have a decent picture of what's going on.

  1. I've not tried the simple consumer approach yet, but I will definitely do so. It'll be really interesting to see how the performance compares.

What drew me to the optimizations that I've outlined in the doc is to avoid as much unneeded execution as possible, and with the sys_enter/sys_exit approach we always have to run something to determine whether to continue processing a syscall. Having said that, if the approach is untenable with changes in the BPF ecosystem and we can achieve similar performance gains without needing specific tracepoints, then I'm absolutely all ears.

As I said, I've been measuring a lot today, so I'll share some of the data/graphs here once I've got a good picture :)

Andreagit97 commented 2 years ago

Hi @Stringy :hand: Thank you for your fast answer!

Unfortunately, there is no rich documentation about the tracepoint topic... I will try to summarize some concepts here:

Tracing programs.

As far as i know there are three main bpf tracing programs today:

From now on, I will use the section name instead of the full name to be less verbose and hopefully clearer. Before kernel version 4.17, we used tp programs in our probe attaching them directly to sys_enter and sys_exit hooks. We didn't use specific hooks like sys_enter_write even if it was possible. The annoying thing is that, as you can see from the code, we have to save the syscall arguments in a map in the syscall_enter phase, and then retrieve this data in the syscall_exit phase. This is due to the fact that tp programs cannot get syscall parameters in the exit events only in the enter ones. Anyway, I suppose that in your test you have used this kind of bpf program, am I wrong?

We overcome this limitation with raw_tp starting from version 4.17. We always hook sys_enter and sys_exit, but here there is a difference, we cannot hook any more specific syscall tracepoint like sys_enter_write.

With the birth of new concepts like BTF, the BPF ecosystem provides a new program type: BPF_PROG_TYPE_TRACING. They are a sort of BTF-enabled evolution of raw_tp as you can read from this patch This kind of program offers two main advantages:

  1. Some helpers introduced in the most recent kernel versions can be used only by these programs and not by the old BPF_PROG_TYPE_RAW_TRACEPOINT ones.

  2. The second and more interesting advantage is the possibility to read kernel memory directly without helpers like bpf_probe_read(). The BPF verifier, which now understands and tracks BTF types natively, allows us to follow kernel pointers directly. You can see an example of these programs here https://nakryiko.com/posts/bpf-core-reference-guide/#btf-enabled-bpf-program-types-with-direct-memory-reads.

The kernel directly memory access is really interesting in our specific case since we extract a lot of information from the kernel in our probe. Unfortunately, like the old raw_tp this type of program cannot be attached to specific syscall hooks like sys_enter_write or sys_exit_write. I didn't find any resources about this topic. What I did is to explore a little bit the libbpf code to understand where we can attach this new kind of BPF_PROG_TYPE_TRACING program. The hook points that I found are listed in the file here: hooks.txt

At the end of this brief discussion, I can clearly state that I completely agree with you. I would like to avoid calling bpf programs that are useless, losing in performance. However, I would not even want to lose these new really interesting features.

My hope was to minimize the overhead from unnecessary calls by terminating the bpf program immediately. Reading your second point, I was very happy that there was just a little overhead with a minimal program. What is not clear to me is why actually using a program like the one below the overhead is there again :tired_face: (this is the same of the previous comment).

SEC("raw_tp/sys_enter")
int catch_syscall_enter_event(struct pt_regs *regs, long id)
{
    /* Get the syscall id from the context. */
    uint32_t syscall_id = id;

    /* Check if we are on 32 bit architecture. */
    if(check_ia32()){ return 0; }

    /* Call the event-specific program in tail call (IF PRESENT, OTHERWISE TERMINATE HERE). */
    bpf_tail_call(ctx, &syscall_enter_tail_table, syscall_id);
}

Could the tail call be an expensive operation? If yes, we could further reduce the cost with a solution like this:


bool valid[SYSCALLS_NUMBER];

SEC("raw_tp/sys_enter")
int catch_syscall_enter_event(struct pt_regs *regs, long id)
{
    /* Get the syscall id from the context. */
    uint32_t syscall_id = id;

    /* Where `valid` is a global variable, which can be directly used without any bpf helpers. */
    if(!valid[id])
    {
       return 0;
    }

    /* ... */
}

We cannot do more than that :disappointed: Anyway, I hope that your further investigations will clarify the right way to go, and for that, I thank you a lot! Using the simple consumer with only 17 syscalls can definitely let us understand syscall_enter and syscall_exit penalty. If the overhead is not so negligible, we could consider other forms of trade-off, or maybe there are some weird ways to attach these programs to those tracepoints as well, I will investigate more on that :smiley:

Stringy commented 2 years ago

Thanks for the detail @Andreagit97, I really appreciate the insight :)

I've completed some measurements this morning, and I can share some graphs and code snippets. It wasn't quite what I expected, but interesting nonetheless. All graphs below have baseline and kernel module measurements as outlined in my document. The kernel module is the Falco driver, and is unmodified. It is included for comparison. All the graphs are also just for the write syscall because the number of data points I can get is very high while being less dependent on I/O compared to read.

First I looked at a useless bpf driver that simply returned zero, having done nothing with any of the data. The overhead of this is significantly reduced as we would expect:

ebpf_no_work

Next I looked at modifying the existing Falco driver in a fairly hacky way to filter out everything that wasn't a syscall that I was interested in:

BPF_PROBE("raw_syscalls/", sys_enter, sys_enter_args)
{
    if (bpf_in_ia32_syscall())
        return 0;

    id = bpf_syscall_get_nr(ctx);
    if (id < 0 || id >= SYSCALL_TABLE_SIZE)
        return 0;

    if (id != 43 && id != 288 && id != 81 && id != 80 && id != 3 && id != 56 && id != 42
        && id != 59 && id != 57 && id != 58 && id != 117 && id != 119 && id != 105 && id != 106
        && id != 48 && id != 41 
    ) {
        return 0;
    }

        // ... etc
}

This yielded a surprisingly high overhead: write-epbf-early-filter-round3

So then to simplify down a little, I tried to filter just reads and writes.

    if (bpf_in_ia32_syscall())
        return 0;

    id = bpf_syscall_get_nr(ctx);
    if (id < 0 || id >= SYSCALL_TABLE_SIZE)
        return 0;

    if (id < 2) return 0;

Unfortunately that produced a similar overhead: write-early-ignore-rw

I suppose the question at this point is: does this overhead matter enough to try to fix it? I think in our (RedHat ACS's) case perhaps the answer to that question is yes, because on a busy enough machine the cumulative overhead is significant for what we're trying to capture and we can build a custom front end to the driver that uses BPF_PROG_TYPE_TRACEPOINT and then tail calls into the fillers.

But perhaps more generally that's not the case, given the intention to use modern BTF tracepoints, and that the impact is at least somewhat understood.

Andreagit97 commented 2 years ago

Hi @Stringy thank you for all data you collected. They will be fundamental in our future decisions. The only thing that seems really strange to me is that instrumenting the kernel with empty BPF programs causes almost no overhead while, using minimal programs, we introduce a significant overhead... This could be due to the fact that in this minimal bpf program, we call 3 bpf helpers:

BPF_PROBE("raw_syscalls/", sys_enter, sys_enter_args)
{
    if (bpf_in_ia32_syscall())
        return 0;

    id = bpf_syscall_get_nr(ctx);
    if (id < 0 || id >= SYSCALL_TABLE_SIZE)
        return 0;

    if (id != 43 && id != 288 && id != 81 && id != 80 && id != 3 && id != 56 && id != 42
        && id != 59 && id != 57 && id != 58 && id != 117 && id != 119 && id != 105 && id != 106
        && id != 48 && id != 41 
    ) {
        return 0;
    }

}

I don't think that a single if case could produce all this overhead, so I suppose that the real cause is the use of bpf helpers. Anyway I see two solutions for this issue:

  1. In the next few days, I will propose a design for a new BPF probe with all modern tracing technologies. With features like bpf global variables, we could reduce this initial overhead by removing all bpf helpers, obtaining something like this:

    bool valid[SYSCALLS_NUMBER];
    
    SEC("tp_btf/sys_enter")
    int catch_syscall_enter_event(struct pt_regs *regs, long id)
    { 
      /* Where `valid` is a global variable, which can be directly accessed without any bpf helpers. 
       * This `valid` variable is a vector of bool that will return true o false for the specific syscall.
       */
      if(!valid[id])
      {
          return 0;
      }
    
     /* ... */

    This is the only way to reduce the overhead with modern technologies that do not allow hooking specific syscall tracepoint like sys_enter_write. Or at least this is the only solution I see right now... Anyway, if the overhead is really due to bpf helpers, with new technologies, we will decrease significantly the number of times in which they are called, so these are at least good news.

  2. In our current probe, I completely agree with you. We need some way to address this initial overhead, and a solution like the one you proposed seems a great approach, in my opinion. We could selectively attach our bpf programs according to our use cases.

This is a really hot topic @Stringy thank you for having highlighted it! :rocket:

FedeDP commented 2 years ago

Hi @Stringy ! Thanks for these benchmarks! It would be interesting to understand what is the impact of the 2 helper calls: bpf_in_ia32_syscall and bpf_syscall_get_nr.
Can we try the same benchmark removing the former (ia32 related)? This way we can better understand whether these calls have any impact at all (or is it just the branch predictor slowing down things).

Thank you very much!

First I looked at a useless bpf driver that simply returned zero, having done nothing with any of the data. The overhead of this is significantly reduced as we would expect:

Interesting that eBPF instrumentation alone has quite a bit of impact even when doing nothing.

Andreagit97 commented 2 years ago

I agree with @FedeDP, if we are able to understand the cost of bpf helpers we can find a solution to avoid them both in the current and in the new probe, removing this noisy initial overhead :disappointed:

Stringy commented 2 years ago

hey all, for those interested I merged our solution to this performance problem yesterday afternoon: RedHat ACS Collector Probe

For us, the performance numbers are looking very good for our use case, but still very interested in @Andreagit97's excellent on-going work on the modern ebpf probe!

Andreagit97 commented 2 years ago

hey all, for those interested I merged our solution to this performance problem yesterday afternoon: RedHat ACS Collector Probe

Hi, @Stringy congratulations! :partying_face: :tada: :tada: I'm really interested in it, I will take a look ASAP

For us, the performance numbers are looking very good for our use case, but still very interested in @Andreagit97's excellent on-going work on the modern ebpf probe!

That's really good news! Let's see what we are able to achieve with the modern probe :eyes: :crossed_fingers:

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP commented 1 year ago

/remove-lifecycle stale

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

leogr commented 1 year ago

/remove-lifecycle stale

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP commented 1 year ago

/remove-lifecycle stale

poiana commented 7 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 7 months ago

/remove-lifecycle stale

leogr commented 7 months ago

cc @falcosecurity/falco-website-maintainers we should think how to document this

poiana commented 4 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

incertum commented 4 months ago

/remove-lifecycle stale

poiana commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 1 month ago

/remove-lifecycle stale