aquasecurity / tracee

Linux Runtime Security and Forensics using eBPF
https://aquasecurity.github.io/tracee/latest
Apache License 2.0
3.5k stars 406 forks source link

Add metrics for event travel time in the pipeline #3247

Open yanivagman opened 1 year ago

yanivagman commented 1 year ago

We should be able to know the time it takes an event to travel in the pipeline from decode to sink stages. The metrics should be per event type so we can know which events consume more time in the pipeline

itaysk commented 1 year ago

we had https://github.com/aquasecurity/tracee/issues/887 to discuss this. the issue is close but I don't think we've implemented what it suggests fully (did we?).

yanivagman commented 1 year ago

we had #887 to discuss this. the issue is close but I don't think we've implemented what it suggests fully (did we?).

The issue was about exporting the data to Prometheus, which was implemented indeed. Most of the metrics you mentioned there are already implemented as well. This issue is about adding a new metric for event travel time in the pipeline, so I opened a new issue for that.

itaysk commented 1 year ago

I linked there because 1) it sounds similar to tracee_detections_latency_seconds metric (which I'm not sure we implemented) and 2) we should consider making it a histogram as suggested in that issue.

(side note, would be nice to document what metrics we expose in https://aquasecurity.github.io/tracee/v0.15/docs/integrating/prometheus/)

yanivagman commented 1 year ago

I linked there because 1) it sounds similar to tracee_detections_latency_seconds metric (which I'm not sure we implemented) and 2) we should consider making it a histogram as suggested in that issue.

  1. We didn't implement the detection latency metric. The original issue was not about which metrics we should expose, but how we expose them (which was indeed implemented thus the issue could be closed). New metrics implementation deserve their own issue, as done here
  2. The event travel time in the pipeline is more generic than detection latency since it is not only intended for detections but for any event of tracee
  3. Histogram can be a good option. We should probably have a histogram per event type and not global for all events