aquasecurity / libbpfgo

eBPF library for Go. Powered by libbpf.
Apache License 2.0
737 stars 94 forks source link

Replace CGO in the critical path #42

Open itaysk opened 3 years ago

itaysk commented 3 years ago

To improve performance, we can bypass libbpf and cgo in the critical path (Libbpf callback)

itaysk commented 3 years ago

from #80

It is well known that cgo has bad performance when calling c code, and even worse when calling go callbacks from c (see, for example, https://about.sourcegraph.com/go/gophercon-2018-adventures-in-cgo-performance/). This is actually not a problem for most of the use cases of libbpfgo, where we just need to load a program and attach it, or update a map, as these operations are not that frequent. It may become a problem, however, when we need to poll for events coming from the kernel through one of the perf/ring buffers, as these are much more frequent. Suggestion: for buffers polling (either perf or ring buffers) let's implement the logic in pure go. These functions can be added as an alternative API for the already existing functions, and will offer high performance where needed. Specifically, these are the libbpf functions that we will need to implement: C.perf_buffer__poll() and C.ring_buffer_poll() (both are called by the PerfBuffer/RingBuffer poll() function)

yanivagman commented 3 years ago

After performing some local tests, it seems that cgo is not the bottleneck of tracee, but the printer. Below is a pprof output where I used a very noisy event (sched_switch) and the gob printer. The conclusion is the same for other workloads (e.g. using default event set on an idle system) and other printers (table/json) - the printer path is always worse than that of cgo.

cgo_vs_printer

yanivagman commented 3 years ago

With table printer:

cgo_vs_printer2

itaysk commented 3 years ago

What does this mean for this issue? Isn't it still something we should do?

yanivagman commented 3 years ago

I still need to find a way to compare a prototype I have with pure go implementation to the current cgo implementation. With pprof I can only see the bottleneck, but can't quantitatively compare between the two implementation. Any suggestion for how to do that?

yanivagman commented 3 years ago

The performance of c to go calls has improved in recent go versions: https://github.com/golang/go/issues/42469#issuecomment-747741061

If there is no strong evidence that this is still an issue for libbpfgo, we may probably close this one for now

simar7 commented 3 years ago

The performance of c to go calls has improved in recent go versions: golang/go#42469 (comment)

If there is no strong evidence that this is still an issue for libbpfgo, we may probably close this one for now

It's also important to note that since we also pass pointers around in our cgo code, that can be much more safely and efficiently done using Cgo Handles as of go 1.17. https://pkg.go.dev/runtime/cgo#Handle

guyarb commented 2 years ago

Hey @yanivagman I saw issue #80 and wondered what did was your plan there? By implementing the polling in pure go you mean just call the epoll_wait from go instead of cgo? Or did you talk about implementing more function inside the perf_buffer_poll? In my project Im handling performance issues that are caused mainly by the perf_buffer_poll cgo implemention and Im trying to find a solution

As i understand, unless we can somehow trigger the perf callback in pure go there is still going to be a massive cpu consumptions as c-to-go is the most expensive directive in cgo

yanivagman commented 2 years ago

Hey @yanivagman I saw issue #80 and wondered what did was your plan there? By implementing the polling in pure go you mean just call the epoll_wait from go instead of cgo? Or did you talk about implementing more function inside the perf_buffer_poll? In my project Im handling performance issues that are caused mainly by the perf_buffer_poll cgo implemention and Im trying to find a solution

As i understand, unless we can somehow trigger the perf callback in pure go there is still going to be a massive cpu consumptions as c-to-go is the most expensive directive in cgo

Hi @guyarb, To implement the polling in pure go, the following changes are required:

  1. Change InitPerfBuff function by removing the usage of C.init_perf_buf. This function then creates a new perf buffer per each cpu in the system (using perf_event_open()) and mmap()s it to memory. The new fds can then be added to epoll.
  2. Change the Start function to call a pure go polling function. This function should then wait for events on the epoll fd, and for each ring buffer that got new events, read the received data from it.

A reference implementation is Cilium's ebpf library (that is written in pure go). When playing with this code and trying to measure differences using pprof, I didn't see major improvements, as described above. It might be that pprof is not the right tool for this task.

In your project, how did you find that the performance issues were caused mainly by the cgo perf_buffer_poll call?