Implement variable-length ringbuf wire format

This commit introduces variable-length fields into the ringbuf wire format.

Over time, the number of fields in our wire format with a large theoretical maximum size has grown (pathnames, argv, tty output, etc.).

Every field in the wire format is of a fixed-size. This simplifies parsing of messages in userspace, but leads to huge memory waste as the maximum possible buffer size needs to be allocated, even when it's not used most of the time. For example, PATH_MAX on linux is 4096, but the pathnames we pass up are going to be a fraction of that size 99% of the time. Terminal output is an even bigger culprit for this sort of thing, a 4 KiB buffer is allocated for each tty_write event, even if the data being written to the tty is just one byte. For a high-frequency tty write event, this can easily cause us to drop events.

This commit adds a few new structures that enable us to pass up these large fields in variable-length buffers. Fields that are always of a fixed size (or variable length but so small as to be irrelevant from a memory-usage perspective, e.g. task->comm) remain as they were before, in a fixed-size part of the event. Variable length fields are packed into a flexible array member at the end of the event struct and each are preceeded by a header describing the type of field and size.

An unfortunate necessity of this approach is that we cannot use the ringbuffer reserve/commit API for events that have variable-length fields, as we won't know the size of the event in advance. Instead, we build each event with its variable-length fields in a percpu map, and then output it using bpf_ringbuf_output. This should result in a small performance hit as we're introducing an additional memory copy, but should be made up for in massive memory usage improvements. An alternative would have been to continue to use the reserve/commit API, passing up each event in fixed-size chunks, but this would just shift the extra copy to userspace, and introduce a bunch of complexity there as we'd need to have reconstruction logic in the Ebpf events library. In any case, there is no free lunch.

Logic and helpers for adding variable length fields have been added to Helpers.h and are fairly well documented. New events that need to use variable-length fields should add a ebpf_varlen_fields_start struct as their last member with a name of vl_fields. Variable-length fields can then be added to the event. For example, in the case of a file delete event:

    struct ebpf_file_delete_event *event = get_event_buffer();

    // ... Fill in fixed-length fields

    // Variable length fields
    ebpf_vl_fields__init(&event->vl_fields);
    struct ebpf_varlen_field *field;
    size_t size;

    // path
    ADD_VL_FIELD_OR_GOTO_EMIT(event, field, EBPF_VL_FIELD_PATH);
    size = ebpf_resolve_path_to_string(field->data, &p, task);
    ebpf_vl_field__set_size(&event->vl_fields, field, size);

emit:
    bpf_ringbuf_output(&ringbuf, event, EVENT_SIZE(event), 0);

The emit label is required. The ADD_VL_FIELD_OR_GOTO_EMIT macro will jump to it if there is no more space remaining in the event buffer returned from get_event_buffer().

The use of variable-length fields also allows us to get rid of a bit of cruft. In particular:

No more PATH_MAX * 2 sized buffers in EbpfEventProto.h that waste space and leak an implementation detail of the BPF probes to userspace
The memset reimplementation to deal with argv has been removed as we will now know size of argv in the variable-length field header, making zeroing out a fixed-size buffer unnecessary

This commit also bumps the ringbuffer size to 4 MiB so that it's larger than the 2 MiB percpu event buffers.

Instead, we build each event with its variable-length fields in a percpu map, and then output it using bpf_ringbuf_output. This should result in a small performance hit as we're introducing an additional memory copy

Any idea how we could measure the performance hit? On the ebpf side, can we use bpftool perf stats or something else to estimate it? I could perhaps test this end-to-end with Phoronix tests but it might be that it won't show up end-to-end. And the performance hit would only be on the ebpf side I suppose? Or userspace as well?

Instead, we build each event with its variable-length fields in a percpu map, and then output it using bpf_ringbuf_output. This should result in a small performance hit as we're introducing an additional memory copy

Any idea how we could measure the performance hit? On the ebpf side, can we use bpftool perf stats or something else to estimate it? I could perhaps test this end-to-end with Phoronix tests but it might be that it won't show up end-to-end.

If it's visible at all in benchmarks, you should be able to see it by creating an artificially high load in terms of the new memory copy (e.g. exec with a 20 kb argv in a tight loop for 30 seconds), graph CPU usage over that period, and compare to the old implementation. I didn't measure that because I'd be quite surprised if it was substantial, but can get some numbers tomorrow.

Come to think of it, it would be nice to do the same thing and measure memory usage / dropped events as well to verify this PR is doing it's job (I've run a few informal tests on my laptop and it's shown a large decrease in dropped events when under load, but haven't saved the results and made them presentable).

Also purely from personal experience, I can say that the Cmd probes used variable length parameters for everything in their wire format, and I don't recall that (in BPF land, or in userspace) ever leading to a performance issue.

And the performance hit would only be on the ebpf side I suppose? Or userspace as well?

Theoretically, there should should be a tiny perf hit in userspace from the fact that parsing the variable length fields requires a bit more math than before (see FOR_EACH_VARLEN_FIELD), but that's undoubtedly inconsequential (one extra memory dereference and addition per variable length parameter).

Ok some quick and dirty performance numbers here.

In my Vagrant dev VM, I fired up EventsTrace as follows:

 sudo ./artifacts-x86_64/non-GPL/Events/EventsTrace/EventsTrace -a > /dev/null

Then fired up a loop spam exec'ing ls with a very long argv:

 while true; do ls $(python3 -c 'print("a" * 10000)'); done

This results in both a huge number of exec's with long argv's and a ton of data written to the terminal, so lots of exec and tty_write events with large variable length fields. Should be about as stressful as you can get for a single-core test w.r.t. the newly added memory copy from bpf_ringbuf_output.

CPU time (both in-kernel and in-userspace) and memory usage were measured when the spam loop had already started for 60s with this script:

echo 'time (s), %user, %sys, usedmem'

time=0
while true; do
    cpu=$(top -n 1 -b | head -3 | tail -1 | awk '{print $2 "," $4}')
    usedmem=$(free | head -2 | tail -1 | awk '{print $3}')
    echo $time,$cpu,$usedmem
    sleep 1
    time=$(echo $time+1 | bc)

    if [[ $time == "60" ]]; then
        break;
    fi
done

From a quick glance, it looks like time spent executing in BPF programs in kernelspace isn't attributed to the process that loaded/attached those programs, hence the system wide measurement with top instead of a per-process value. Should add some noise, but we will be picking up the extra work done by the BPF program that way.

Graphs of results are as follows (old = fixed length, new = variable length)

Variable-Length Fields vs Fixed-Length_ CPU Usage

Variable-Length Fields vs Fixed-Length_ Memory Usage

CPU usage doesn't seem to be measurably different in kernelspace, which is where the extra copy is. Hard to tell, but there might be a tiny bump in userspace CPU time from the extra parsing. In any case, the effect if it exists is very small.

Memory usage is ~15MB higher, which makes sense. My VM has 5 cores, and there's an extra 2MB of locked memory added per core from the per-cpu map, plus another ~4MB from the ringbuffer size bump, so about 15 MB total.

Given the caps I put on argv and tty_out (20 kb and 8 kb respectively), 2MB for each percpu buffer is excessive as it's impossible to hit that for our largest feasible event. Originally, I chose an arbitrary, fairly big number just so we didn't have to worry about overrunning it in the future, but will drop it to 256 KiB as that is still more than enough to accommodate our largest-sized event and will and reduce memory usage on systems with a tons of cores.

Ok some quick and dirty performance numbers here.

Awesome, love the charts! Interesting that it seems to be higher on the user CPU time, but it's close enough that it could just be within error.

Given the caps I put on argv and tty_out (20 kb and 8 kb respectively), 2MB for each percpu buffer is excessive as it's impossible to hit that for our largest feasible event. Originally, I chose an arbitrary, fairly big number just so we didn't have to worry about overrunning it in the future, but will drop it to 256 KiB as that is still more than enough to accommodate our largest-sized event and will and reduce memory usage on systems with a tons of cores.

Cool. After all, EC2 has instances with over 440 cores (although those would almost certainly have much more RAM than 1GB)

BTW, I just saw that on new kernels you can set kernel.bpf_stats_enabled=1 and get statistics like total runtime per each loaded eBPF program using bpftool.

elastic / ebpf

Implement variable-length ringbuf wire format #159