cilium / tetragon

eBPF-based Security Observability and Runtime Enforcement
https://tetragon.io
Apache License 2.0
3.57k stars 352 forks source link

ProcessCache: Remove stale entries #2953

Open kevsecurity opened 1 week ago

kevsecurity commented 1 week ago

Add stale entry removal functionality to the process cache. This won't affect normal operations, but in the event that a custom ref count has been placed on an entry and not removed (for whatever reason, maybe logic error, maybe missed event, etc), remove entries where the process and all its children have exited at least 10 minutes ago. Record a metric when entries are removed.

netlify[bot] commented 1 week ago

Deploy Preview for tetragon ready!

Name Link
Latest commit c2255d9b2ce97b7e23506928bac8bf7c7b772a31
Latest deploy log https://app.netlify.com/sites/tetragon/deploys/66fbe1b9aec2e20008dfd2b9
Deploy Preview https://deploy-preview-2953--tetragon.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mtardy commented 1 week ago

Record a metric when entries are removed.

Thanks, I was thinking about adding this. Thinking about this I also realize that it would make issues difficult to debug in the future because we'll struggle to see the actual events that were flushed, should we add an option to append those events to a (circular) debug file?

kkourt commented 1 week ago

Record a metric when entries are removed.

Thanks, I was thinking about adding this. Thinking about this I also realize that it would make issues difficult to debug in the future because we'll struggle to see the actual events that were flushed, should we add an option to append those events to a (circular) debug file?

I agree. If we are doing this, it would also be nice to have a switch to configure the parameters that right now are hardcoded.

mtardy commented 1 week ago

Record a metric when entries are removed.

Thanks, I was thinking about adding this. Thinking about this I also realize that it would make issues difficult to debug in the future because we'll struggle to see the actual events that were flushed, should we add an option to append those events to a (circular) debug file?

I agree. If we are doing this, it would also be nice to have a switch to configure the parameters that right now are hardcoded.

we had this discussion offline and maybe having a switch to disable this could be enough to investigate and try to reproduce the cases where we see the metric goes out of the roof.

it would also be nice to have a switch to configure the parameters that right now are hardcoded.

maybe you can just use the special GC interval value to 0 to disable it entirely so that you have a two in one option :D

kevsecurity commented 1 week ago

I agree. If we are doing this, it would also be nice to have a switch to configure the parameters that right now are hardcoded.

And also make the existing intervalGC value configurable? It's been hard-coded for ever.

kevsecurity commented 1 day ago

Converted to draft. We may not need the second commit.