Add instrumentation logging to KSQL

In the past I've added instrumentation logging to applications I've had to monitor. This can be invaluable for ops teams and during incident investigations.

Instrumentation logs, (my name, not sure if there's a different one), are computer friendly structured logs that capture information that is either useful when building dashboards or when investigating incidents, i.e. its about logging information about events that are interesting. Some examples in KSQL:

Lifecycle of KSQL nodes, e.g. STARTING, CHECKING_PREREQUISITS, RESTORING, RUNNING STOPPING, STOPPED, ERROR.
Lifecycle of a query, e.g. STARTED, RUNNING, ERROR, STOPPING etc.
Lifecycle of a request, eg. RECEIVED, PARSED, QUEUED, EXECUTED, COMPLETE, etc,
Error events, e.g. couldn't talk to external service, or noticed a query had failed, etc.

The set of events can be built up over time. New investigations may highlight missing events or data for existing events. Tasks to add this can be raised as part of the investigation.

The instrumentation logs are written so that they are easy for log monitoring software to parse, e.g Splunk, Kibana stack, Logscape, etc.

I've previously written libraries to help build such log lines programmatically. But we can start with just manual log lines if we want. Which ever approach we use it should be super easy to add new instrumentation or extend existing.

Each event should have a set of data associated with it. Some of this is pretty standard, e.g. the time, the info level, etc. But we can add more. The different lifecycle events for the same entity will likely have some common info and some that is different.

The log lines for a request may have:
- A UUID / correlation id for the request, so that all associated log lines can easily be found. (Even better if this is made available down the call stack!)
- the user making the request (if authenticated)
- the SQL statement
- the final status code of the response, (only for last 'COMPLETE' line)
- etc.
For a query the information may be:
- the SQL
- the Topology
- The UUID of the request if came from, (if any).
- etc

An example log line may look something like:

2020:01:12T23:01:11 - SomeThreadId - entityType: Query, entity: CTAS_Blah_blah, event: started, took: 123 ms, sql: <<<CREATE TABLE AS SELECT * FROM BLAH;>>>

Best to keep anything that's free form to the end of the message, e.g. the SQL, and wrap it in some kind of delimiter to make it easier to parse.

We could even look to add instrumentation that logs out the number of messages each stream thread / query is processing, or the same of the data being processed. Potentially, this could be done at multiple points in the topology. Sure, there is an overhead to this so it should definitely be opt-in, but it can be very powerful.

Users can pull this info into Splunk / Kibana etc and build powerful dashboards, e.g. showing the number of queries running, their SQL, their throughput, lag, etc. Companies running many clusters can have high-level dashboards showing overall metrics for numbers of clusters, cluster sizes, cluster lifecycle events, etc.

We could even have example Kibana dashboards people can start from.

I've normally logged instrumentation logs to a separate log file, to separate them from the noise of the application logs, which can be full of junk.

I found this so useful in past companies that I even blogged about it https://datalorax.wordpress.com/2014/06/20/the-truth-about-your-logs/https://datalorax.wordpress.com/2014/06/26/more-bam-for-your-buck/ . ;)

confluentinc / ksql

Add instrumentation logging to KSQL #4301