QnnOkabayashi / tracing-forest

65 stars 13 forks source link

Long running span detection / warning #6

Open Firstyear opened 2 years ago

Firstyear commented 2 years ago

In non-release builds, if a span is held open for more than 5 seconds, it should log a warning. You should be able to check this by setting a SystemTime at the time the span is opened, and the next time a write occurs to the span, checking that time difference.

This way you can detect and warn about potential span-leaks for unbounded memory growth (think socket listener).

QnnOkabayashi commented 2 years ago

Can you expand on what scenarios this could occur in (i.e. socket listeners)? I want to better understand why this would ever happen.

I'm thinking this is more application specific. However, it could be implemented as an opt-in feature using a span's fields. For example, we could have something like

let span = info_span!(warn_after = "5s", "socket-listener");
Firstyear commented 2 years ago

So consider some code like this:

https://github.com/kanidm/kanidm/blob/master/kanidmd/score/src/ldaps.rs#L142

If we attached a span to this tcp acceptor, because the infinite loop exists it will never close the span and so it will continue to build a tree up that won't be flushed till server shutdown.

Similar for a connected client, https://github.com/kanidm/kanidm/blob/master/kanidmd/score/src/ldaps.rs#L47 this will exist for as long as a client is connected, which could be a very long time depending on various factors. I've seen operationally, ldap clients connect and remain connected for hours at a time and issue thousands if not millions of operations. (IE mail routers).

So there are a few issues here:

Some ways to potentially go about this:

What do you think?

QnnOkabayashi commented 2 years ago
  • If the span has existed for more than 60seconds, then log a crit level warning immediately. You can detect this by checking a time window between "time of span creatation" and "time of log message being written". Could be a debug build only thing. And also log the warning every time it's detected, be noisy as heck :)

If I'm understanding this correctly, you would need to be very careful that the generated warning log isn't triggering another warning to be emitted to avoid recursion.

Also, it seems that this behavior is more Kanidm specific, and should be implemented as it's own Layer.