Azure-Samples / databricks-observability

OpenTelemetry Demo with Azure Databricks and Azure Monitor
MIT License
16 stars 5 forks source link

Databricks observability demo

This demo illustrates the collection of metrics, traces and logs from Databricks using OpenTelemetry.

It showcases an automated deployment of a solution with Azure Databricks, sample jobs and collection to Azure Monitor.

Features

This demo provides the following features:

The demo is automated and can be deployed using Terraform with just two commands.

Getting Started

Prerequisites

Note: you can also use Azure Cloud Shell to avoid having to install software locally.

Installation

⚠️ This sets up a cluster of two nodes, and recurring jobs running every minute, so that the cluster never automatically shuts down. This will incur high costs if you forget to tear down the resources!

In case transient deployment errors are reported, run the terraform apply command again.

Destroying the solution

Run:

terraform destroy

Spark Logs and Metrics

Spark Logs and Metrics are collected automatically by the JVM agent.

In the Azure Portal, open the deployed Application Insights resource. Open the Logs pane.

Run the sample queries provided to visualize different metrics and logs.

Note that there might be a log of a few minutes until the data appears.

Tasks

customMetrics
| where name endswith 'Tasks'
| render timechart

tasks

Memory

customMetrics
| where name startswith 'spark'
| where name contains 'Memory'
| project-rename memory_bytes = value
| render timechart

Memory

customMetrics
| extend ip = iif(tobool(customDimensions["DB_IS_DRIVER"]), "driver", customDimensions["DB_CONTAINER_IP"])
| where name in ('spark.driver.ExecutorMetrics.OnHeapUnifiedMemory', 'spark.worker.ExecutorMetrics.OnHeapUnifiedMemory')
| project timestamp, ip, heap_memory_bytes = value
| render timechart

Heap memory

Scheduler message processing time

customMetrics
| where name contains "messageProcessingTime"
| project-rename messageProcessingTime_ms = value
| where not(name contains "count")
| render timechart

Message processing time

Structured streaming

customMetrics
| where name startswith 'spark.streaming.'
| render timechart

Streaming

Note these are high-level metrics for all streaming queries. For capturing detailed metrics, see Custom streaming metrics below.

Logs

traces

Traces

Python and Java log correlation

By using Spark's integration with Mapped Diagnostic Context (MDC), as demonstrated in sample-telemetry-notebook.py, some Java logs can be correlated with their corresponding Python root span.

// Get trace ID via an arbitrary root span.
let trace_id = dependencies
| where name == "process trips"
| project operation_Id
| limit 1;
// Fetch the logs from both Python and Java through manual correlation.
traces
| where (operation_Id in (trace_id)) or (customDimensions["mdc.pyspark_trace_id"] in (trace_id))

Log correlation

JVM Traces

Traces are automatically collected, allowing to trace distributed requests to services like Azure Storage, SQL Server and other types of storage.

Application Map

In Application Insights, open the Application Map pane.

Application Map

Python Telemetry

Instrumenting Python code requires additional code. The telemetry notebooks illustrate how that can be achieved.

Custom logs and spans

The notebook sample-telemetry-notebook contains code to capture custom logs and spans.

The notebook is wrapped in a sample-telemetry-caller notebook to ensure the end of the root span is recorded.

In Application Insights, open the Transaction search pane. In the Event types filter, select Dependency. In the Place search terms here box, type process. In the Results pane, select any result with Name: process trips.

End-to-end transaction

Open the Traces & events pane for the transaction at the bottom of the screen.

End-to-end transaction traces

Custom metrics

The notebook sample-telemetry-notebook also contains code to capture custom metrics.

In Application Insights, open the Metrics pane. In the Metric Namespace filter, select /shared/sample-telemetry-notebook. In the Metric filter, select save_duration.

Metric

Custom streaming metrics

The notebook sample-streaming-notebook contains code to capture custom metrics from a streaming query.

In Application Insights, open the Metrics pane. In the Metric Namespace filter, select /shared/sample-streaming-notebook. In the Metric filter, select avg_value.

Metric

About the solution

Overview

The solution deploys Azure Databricks connected to Azure Application Insights for monitoring via the Spark JMX Sink. One Databricks job runs periodically and is set up to fail about 50% of the time, to provide "interesting" logs. Other jobs are also set up to demonstrate different types of telemetry.

The cluster is configured to use an external Hive metastore in Azure SQL Database.

Init script

The solution contains a cluster node initialization script that generates a configuration file for the agent, based on templates in the solution.

Spark JMX MBeans on executor nodes are prefixed with a configurable namespace named followed by the executor ID, which is a different number on on every worker node. The Azure Monitor agent allows using an object name pattern when defining JMX beans to monitor, though this feature is undocumented as of April 2023 (a documentation update was submitted).

For example, the JMX metrics MBean pattern metrics:name=spark.*.executor.threadpool.startedTasks,type=gauges would match each of the following MBeans on a cluster with 3 worker nodes:

metrics:name=spark.0.executor.threadpool.startedTasks,type=gauges
metrics:name=spark.1.executor.threadpool.startedTasks,type=gauges
metrics:name=spark.2.executor.threadpool.startedTasks,type=gauges

All MBeans are set up to report the same Application Insights metric name:

spark.worker.executor.threadpool.startedTasks

Each executor agent reports its metric under the common Application Insights metric name, so that the values can be tallied up.

The configuration for the applicationinsights.json files was initially generated with this notebook to collect MBean information from each cluster node.

Known limitations

Main contributors