dataflint / spark

Performance Observability for Apache Spark
Apache License 2.0
198 stars 21 forks source link
apache-spark big-data data-pipeline data-pipelines databricks dataproc emr etl observability optimization spark-operator

Logo

Spark Performance Made Simple

[![Maven Package](https://maven-badges.herokuapp.com/maven-central/io.dataflint/spark_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/io.dataflint/spark_2.12) [![Slack](https://img.shields.io/badge/Slack-Join%20Us-purple)](https://join.slack.com/t/dataflint/shared_invite/zt-28sr3r3pf-Td_mLx~0Ss6D1t0EJb8CNA) [![Test Status](https://github.com/dataflint/spark/actions/workflows/ci.yml/badge.svg)](https://github.com/your_username/your_repo/actions/workflows/tests.yml) [![Docs](https://img.shields.io/badge/Docs-Read%20the%20Docs-blue)](https://dataflint.gitbook.io/dataflint-for-spark/) ![License](https://img.shields.io/badge/License-Apache%202.0-orange) If you enjoy DataFlint please give us a ⭐️ and join our [slack community](https://join.slack.com/t/dataflint/shared_invite/zt-28sr3r3pf-Td_mLx~0Ss6D1t0EJb8CNA) for feature requests, support and more!

What is DataFlint?

DataFlint is a modern, user-friendly enhancement for Apache Spark that simplifies performance monitoring and debugging. It adds an intuitive tab to the existing Spark Web UI, transforming a powerful but often overwhelming interface into something easy to navigate and understand.

Why DataFlint?

With DataFlint, spend less time deciphering Spark Web UI and more time deriving value from your data. Make big data work better for you, regardless of your role or experience level with Spark.

Usage

After installation, you will see a "DataFlint" tab in the Spark Web UI. Click on it to start using DataFlint.

Logo

Demo

Demo

Features

See Our Features for more information

Installation

Scala

Install DataFlint via sbt:

libraryDependencies += "io.dataflint" %% "spark" % "0.2.6"

Then instruct spark to load the DataFlint plugin:

val spark = SparkSession
    .builder()
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
    ...
    .getOrCreate()

PySpark

Add these 2 configs to your pyspark session builder:

builder = pyspark.sql.SparkSession.builder
    ...
    .config("spark.jars.packages", "io.dataflint:spark_2.12:0.2.6") \
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
    ...

Spark Submit

Alternatively, install DataFlint with no code change as a spark ivy package by adding these 2 lines to your spark-submit command:

spark-submit
--packages io.dataflint:spark_2.12:0.2.6 \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...

Additional installation options

How it Works

How it Works

DataFlint is installed as a plugin on the spark driver and history server.

The plugin exposes an additional HTTP resoures for additional metrics not available in Spark UI, and a modern SPA web-app that fetches data from spark without the need to refresh the page.

For more information, see how it works docs

Medium Articles

Compatibility Matrix

DataFlint require spark version 3.2 and up, and supports both scala versions 2.12 or 2.13.

Spark Platforms DataFlint Realtime DataFlint History server
Local
Standalone
Kubernetes Spark Operator
EMR
Dataproc
HDInsights
Databricks

For more information, see supported versions docs