Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
210 stars 19 forks source link

Issue 331: Add Compute Histogram utility method #332

Closed osopardo1 closed 3 months ago

osopardo1 commented 3 months ago

Description

Fixes #331

Type of change

New Feature, no breaking change.

Easy API for computing the histogram for a column. Usage:

import io.qbeast.spark.utils.QbeastUtils

val brandStats = QbeastUtils.computeHistogramForColumn(df, "brand", 50)
val statsStr = s"""{"brand_histogram":$brandStats}"""

(df
  .write
  .mode("overwrite")
  .format("qbeast")
  .option("columnsToIndex", "brand:histogram")
  .option("columnStats", statsStr)
  .save(targetPath))

Checklist:

Here is the list of things you should do before submitting this pull request:

How Has This Been Tested? (Optional)

Test can be found inio.qbeast.spark.utils.QbeastUtilsTest

Test Configuration: