Open javierhonduco opened 1 year ago
Would be curious to know if this is something you've also experienced in Tracee. It would be fantastic to see if implementing this in Go would help here. I think it would!
This is something we have discussed historically indeed. There was an issue opened in Tracee for moving away from the CGO polling logic and implement that in Go (inside libbpfgo).
Maybe the time has come ? @yanivagman FYI
This is something we have discussed historically indeed. There was an issue opened in Tracee for moving away from the CGO polling logic and implement that in Go (inside libbpfgo).
Maybe the time has come ? @yanivagman FYI
Yes, indeed we discussed this in the past (https://github.com/aquasecurity/libbpfgo/issues/42). Like I wrote in the other issue, cgo improved in recent versions of go and I didn't see any particular improvement moving to pure go. Yet, I didn't put too much effort into this back then, and it may be a good idea to explore it again if you see a performance impact related to cgo.
FWIW we use Go 1.20
In the profiler we develop we use perf buffers to communicate events with userspace. We use this to notify of new processes that we need to generate information for, among other things. We use the default timeout, 300ms. While in the future we might conditionally use ring buffers, we have to support perf buffers for older kernels (<5.8).
While analysing the performance of our own profiler we've noticed that almost 26% of the CPU cycles are spent polling the buffers. It's well known that the Go-C boundary crossing is not cheap (thanks to Go for not following C's ABI!!), which is already documented in this TODO:
https://github.com/aquasecurity/libbpfgo/blob/0aa339608d1efcf54a67be598b07d81c5746f0f5/libbpfgo.go#L1897
Would be curious to know if this is something you've also experienced in Tracee. It would be fantastic to see if implementing this in Go would help here. I think it would!
In the meantime, I've opened https://github.com/aquasecurity/libbpfgo/pull/309 to configure the timeout, which is something we needed even if the overhead were lower, but that can help folks that are willing to reduce overhead despite the higher chances of lost events and higher latency.