confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
103 stars 1.04k forks source link

Add instrumentation logging to KSQL #4301

Open big-andy-coates opened 4 years ago

big-andy-coates commented 4 years ago

In the past I've added instrumentation logging to applications I've had to monitor. This can be invaluable for ops teams and during incident investigations.

Instrumentation logs, (my name, not sure if there's a different one), are computer friendly structured logs that capture information that is either useful when building dashboards or when investigating incidents, i.e. its about logging information about events that are interesting. Some examples in KSQL:

The set of events can be built up over time. New investigations may highlight missing events or data for existing events. Tasks to add this can be raised as part of the investigation.

The instrumentation logs are written so that they are easy for log monitoring software to parse, e.g Splunk, Kibana stack, Logscape, etc.

I've previously written libraries to help build such log lines programmatically. But we can start with just manual log lines if we want. Which ever approach we use it should be super easy to add new instrumentation or extend existing.

Each event should have a set of data associated with it. Some of this is pretty standard, e.g. the time, the info level, etc. But we can add more. The different lifecycle events for the same entity will likely have some common info and some that is different.

An example log line may look something like:

2020:01:12T23:01:11 - SomeThreadId - entityType: Query, entity: CTAS_Blah_blah, event: started, took: 123 ms, sql: <<<CREATE TABLE AS SELECT * FROM BLAH;>>>

Best to keep anything that's free form to the end of the message, e.g. the SQL, and wrap it in some kind of delimiter to make it easier to parse.

We could even look to add instrumentation that logs out the number of messages each stream thread / query is processing, or the same of the data being processed. Potentially, this could be done at multiple points in the topology. Sure, there is an overhead to this so it should definitely be opt-in, but it can be very powerful.

Users can pull this info into Splunk / Kibana etc and build powerful dashboards, e.g. showing the number of queries running, their SQL, their throughput, lag, etc. Companies running many clusters can have high-level dashboards showing overall metrics for numbers of clusters, cluster sizes, cluster lifecycle events, etc.

We could even have example Kibana dashboards people can start from.

I've normally logged instrumentation logs to a separate log file, to separate them from the noise of the application logs, which can be full of junk.

I found this so useful in past companies that I even blogged about it https://datalorax.wordpress.com/2014/06/20/the-truth-about-your-logs/https://datalorax.wordpress.com/2014/06/26/more-bam-for-your-buck/ . ;)

rodesai commented 4 years ago

@big-andy-coates the structured logging stuff we added for the processing logger should come in handy here: https://github.com/confluentinc/common/pull/173