Spark-Dashboard is a monitoring tool that collects Apache Spark metrics and displays them on a customizable Grafana dashboard for real-time performance tracking and optimization.
Main author and contact: Luca.Canali@cern.ch
This technical drawing outlines an integrated monitoring pipeline for Apache Spark using open-source components. The flow of the diagram illustrates the following components and their interactions:
Note: spark-dashboard v1 (the original implementation) uses InfluxDB as the time-series database, see also spark-dashabord v1 architecture
This quickstart guide outlines three methods for deploying Spark Dashboard:
If you opt to deploy using a container image, follow these steps:
The provided container image has been built configured to run InfluxDB and Grafana
docker run -p 3000:3000 -p 2003:2003 -d lucacanali/spark-dashboard
podman run -p 3000:3000 -p 2003:2003 -d lucacanali/spark-dashboard
You need to configure Spark to send the metrics to the desired Graphite endpoint + the add the related configuration.
You can do this by editing the file metrics.properties
located in $SPARK_CONF_DIR
as follows:
# Add this to metrics.properties
*.sink.graphite.host=localhost
*.sink.graphite.port=2003
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=lucatest
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Additional configuration, that you should pass as command line options (or add to spark-defaults.conf):
--conf spark.metrics.staticSources.enabled=true
--conf spark.metrics.appStatusSource.enabled=true
Instead of using metrics.properties, you may prefer to use Spark configuration options directly. It's a matter of convenience and depends on your use case. This is an example of how to do it:
# VictoriaMetrics Graphite endpoint, point to the host where the VictoriaMetrics container is running
VICTORIAMETRICS_ENDPOINT=`hostname`
bin/spark-shell (or spark-submit or pyspark)
--conf "spark.metrics.conf.*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" \
--conf "spark.metrics.conf.*.sink.graphite.host"=$VICTORIAMETRICS_ENDPOINT \
--conf "spark.metrics.conf.*.sink.graphite.port"=2003 \
--conf "spark.metrics.conf.*.sink.graphite.period"=10 \
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds \
--conf "spark.metrics.conf.*.sink.graphite.prefix"="lucatest" \
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource" \
--conf "spark.metrics.staticSources.enabled"=true \
--conf "spark.metrics.appStatusSource.enabled"=true
Optional configuration if you want to collect and display "Tree Process Memory Details":
--conf spark.executor.processTreeMetrics.enabled=true
The dashboard provides visualization of the collected metrics:
How to use:
http://localhost:3000
(edit locahost
to point to your Grafana, as relevant)Notes:
An extended Spark dashboard pipeline is available to collect and visualize OS and storage data. This utilizes Spark Plugins to collect the extended metrics. The metrics are collected and stored in the same VictoriaMetrics database as the Spark metrics.
The extended Spark dashboard has three additional groups of graphs compared to the "standard" SPark Dashboard:
Configuration:
--conf ch.cern.sparkmeasure:spark-plugins_2.12:0.3
--conf spark.plugins=ch.cern.HDFSMetrics,ch.cern.CgroupMetrics,ch.cern.CloudFSMetrics
Use the extended dashboard
wget https://sparkdltrigger.web.cern.ch/sparkdltrigger/TPCDS/tpcds_10.zip unzip -q tpcds_10.zip
tpcds_pyspark_run.py -d tpcds_10 -n 1 -r 1 --queries q1,q2
docker run -p 2003:2003 -p 3000:3000 -d lucacanali/spark-dashboard
TPCDS_PYSPARK=which tpcds_pyspark_run.py
spark-submit --master local[] \
--conf "spark.metrics.conf..sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" \
--conf "spark.metrics.conf..sink.graphite.host"="localhost" \
--conf "spark.metrics.conf..sink.graphite.port"=2003 \
--conf "spark.metrics.conf..sink.graphite.period"=10 \
--conf "spark.metrics.conf..sink.graphite.unit"=seconds \
--conf "spark.metrics.conf..sink.graphite.prefix"="lucatest" \
--conf "spark.metrics.conf..source.jvm.class"="org.apache.spark.metrics.source.JvmSource" \
--conf "spark.metrics.staticSources.enabled"=true \
--conf "spark.metrics.appStatusSource.enabled"=true \
--conf spark.driver.memory=4g \
--conf spark.log.level=error \
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.24 \
$TPCDS_PYSPARK -d tpcds_10
#### Running TPCDS on a Spark cluster
- Example of running TPCDS on a YARN Spark cluster, monitor with the Spark dashboard:
TPCDS_PYSPARK=which tpcds_pyspark_run.py
spark-submit --master yarn --conf spark.log.level=error --conf spark.executor.cores=8 --conf spark.executor.memory=64g \
--conf spark.driver.memory=16g --conf spark.driver.extraClassPath=tpcds_pyspark/spark-measure_2.12-0.24.jar \
--conf spark.dynamicAllocation.enabled=false --conf spark.executor.instances=32 --conf spark.sql.shuffle.partitions=512 \
$TPCDS_PYSPARK -d hdfs://
- Example of running TPCDS on a Kubernetes cluster with S3 storage, monitor this with the extended dashboard using Spark plugins:
TPCDS_PYSPARK=which tpcds_pyspark_run.py
spark-submit --master k8s://https://xxx.xxx.xxx.xxx:6443 --conf spark.kubernetes.container.image=
---
## Old implementation (v1)
### How to run the Spark dashboard V1 on a container
This is the original implementation of the tool using InfluxDB and Grafana
**1. Start the container**
The provided container image has been built configured to run InfluxDB and Grafana
-`docker run -p 3000:3000 -p 2003:2003 -d lucacanali/spark-dashboard:v01`
- Note: port 2003 is for Graphite ingestion, port 3000 is for Grafana
- More options, including on how to persist InfluxDB data across restarts at: [Spark dashboard in a container](dockerfiles)
**2. Spark configuration**
See above
**3. Visualize the metrics using a Grafana dashboard**
- Point your browser to `http://hostname:3000` (edit `hostname` as relevant)
- See details above
---
### How to run the dashboard V1 on Kubernetes using Helm
If you chose to run on Kubernetes, these are steps:
1. The Helm chart takes care of configuring and running InfluxDB and Grafana:
- Quickstart: `helm install spark-dashboard https://github.com/cerndb/spark-dashboard/raw/master/charts/spark-dashboard-0.3.0.tgz`
- Details: [charts](charts)
2. Spark configuration:
- Configure `metrics.properties` as detailed above.
- Use `INFLUXDB_ENDPOINT=spark-dashboard-influx.default.svc.cluster.local` as the InfluxDB endpoint in
the Spark configuration.
3. Grafana's visualization with Helm:
- The Grafana dashboard is reachable at port 3000 of the spark-dashboard-service.
- See service details: `kubectl get service spark-dashboard-grafana`
- When using NodePort and an internal cluster IP address, this is how you can port forward to the service from
the local machine: `kubectl port-forward service/spark-dashboard-grafana 3000:3000`
More info at [Spark dashboard on Kubernetes](charts/README.md)
---
## Advanced configurations and notes
### Graph annotations: display query/job/stage start and end times
Optionally, you can add annotation instrumentation to the performance dashboard v1.
Annotations provide additional info on start and end times for queries, jobs and stages.
To activate annotations, add the following additional configuration, needed for collecting and writing
extra performance data:
INFLUXDB_HTTP_ENDPOINT="http://`hostname`:8086" --packages ch.cern.sparkmeasure:spark-measure_2.12:0.24 \ --conf spark.sparkmeasure.influxdbURL=$INFLUXDB_HTTP_ENDPOINT \ --conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSink \
### Notes
- More details on how this works and alternative configurations at [Spark Dashboard](https://github.com/LucaCanali/Miscellaneous/tree/master/Spark_Dashboard)
- The dashboard can be used when running Spark on a cluster (Kubernetes, YARN, Standalone) or in local mode.
- When using Spark in local mode, use Spark version 3.1 or higher, see [SPARK-31711](https://issues.apache.org/jira/browse/SPARK-31711)
### Docker / Podman
- Telegraf will use port 2003 (graphite endpoint) and port 8428 (VictoriaMetrics source) of your machine/VM.
- For dashboard v1: InfluxDB will use port 2003 (graphite endpoint), and port 8086 (http endpoint) of
your machine/VM (when running using `--network=host`).
- Note: the endpoints need to be available on the node where you started the container and
reachable by Spark executors and driver (mind the firewall).
### Helm
- Find the InfluxDB endpoint IP with `kubectl get service spark-dashboard-influx`.
- Optionally, resolve the DNS name with `nslookup` of such IP.
For example, the InfluxDB service host name of a test installation is: `spark-dashboard-influx.default.svc.cluster.local`
### Customizing and adding new dashboards
- This implementation comes with some example dashboards. Note that only a subset of the
metrics values logged into VictoriaMetrics are visualized in the provided dashboard.
- For a full list of the available metrics see the [documentation of Spark metrics system](https://github.com/apache/spark/blob/master/docs/monitoring.md#metrics).
- New dashboards can be added by putting them in the relevant `grafana_dashboards` folder and re-building the container image
(or re-packaging the helm chart).
- On Helm: running helm-update is enough to upload it as ConfigMap and make it available to Grafana.
- Automatically persisting manual edits is not supported at this time.