Gelerion / spark-sketches

Integrating probabilistic algorithms into Spark using DataSketches
MIT License
7 stars 1 forks source link
big-data datasketches scala spark

Overview

The goal of this project is to seamlessly integrate probabilistic algorithms, in the form of the DataSketches into Spark. This can be an excellent solution for both real-time analytics and incremental historical data updates, as long as the results can tolerate slight inaccuracies (which have mathematically proven error bounds).

The project currently supports two types of sketches:

Highlights

Get it!

Maven

<dependency> 
 <groupId>com.gelerion.spark.sketches</groupId> 
 <artifactId>spark-sketches</artifactId> 
 <version>1.0.0</version> 
</dependency>

Versions compatibility Matrix

Project version Scala Spark
1.0.0 2.12 3.3.x -> 3.x.x
0.9.0 2.12 3.0.x -> 3.2.x
0.8.2 2.11 2.4.3

Use It!

In SQL

First, register the functions. This only needs to be done once.

import org.apache.spark.sql.registrar

SketchFunctionsRegistrar.registerFunctions(spark)

Aggregating

The sketch will be created in the form of raw bytes. This allows us to save it into a file and recreate the sketch after loading it. Alternatively, we can feed these files to our real-time analytic layer, such as Druid.

SELECT dimension_1, dimension_2, theta_sketch_build(metric)
      FROM table
      GROUP BY dimension_1, dimension_2

Merging sketches

One of the most fascinating features about the sketches is that they are mergeable. The function takes other sketch as an input.

SELECT dim_1, theta_sketch_merge(sketch)
      FROM table
      GROUP BY dimension_1

Getting results

SELECT theta_sketch_get_estimate(sketch) FROM table

In Scala

First, import the functions.

import org.apache.spark.sql.functions_ex._

Aggregating

df.groupBy($"dimension")
  .agg(theta_sketch_build($"metric"))

Merging sketches

df.groupBy($"dimension")
  .agg(theta_sketch_merge($"sketch"))

Getting results

df.select(theta_sketch_get_estimate($"sketch"))

Available functions