The goal of this project is to seamlessly integrate probabilistic algorithms, in the form of the DataSketches into Spark. This can be an excellent solution for both real-time analytics and incremental historical data updates, as long as the results can tolerate slight inaccuracies (which have mathematically proven error bounds).
The project currently supports two types of sketches:
.toRdd
casts or error-prone methods such as map
, mapPartition
, etc.SQL
and DataFrame
/Dataset
APIsTypedImperativeAggregate
.<dependency>
<groupId>com.gelerion.spark.sketches</groupId>
<artifactId>spark-sketches</artifactId>
<version>1.0.0</version>
</dependency>
Project version | Scala | Spark |
---|---|---|
1.0.0 | 2.12 | 3.3.x -> 3.x.x |
0.9.0 | 2.12 | 3.0.x -> 3.2.x |
0.8.2 | 2.11 | 2.4.3 |
First, register the functions. This only needs to be done once.
import org.apache.spark.sql.registrar
SketchFunctionsRegistrar.registerFunctions(spark)
The sketch will be created in the form of raw bytes
. This allows us to save it into a file and recreate
the sketch after loading it. Alternatively, we can feed these files to our real-time analytic layer, such as Druid.
SELECT dimension_1, dimension_2, theta_sketch_build(metric)
FROM table
GROUP BY dimension_1, dimension_2
One of the most fascinating features about the sketches is that they are mergeable. The function takes other sketch as an input.
SELECT dim_1, theta_sketch_merge(sketch)
FROM table
GROUP BY dimension_1
SELECT theta_sketch_get_estimate(sketch) FROM table
First, import the functions.
import org.apache.spark.sql.functions_ex._
df.groupBy($"dimension")
.agg(theta_sketch_build($"metric"))
df.groupBy($"dimension")
.agg(theta_sketch_merge($"sketch"))
df.select(theta_sketch_get_estimate($"sketch"))