Implement groupby by multiple columns in Spark DataFrame API

OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.

https://pipelinedp.io/

Apache License 2.0

270 stars 75 forks source link

Implement groupby by multiple columns in Spark DataFrame API #501

Closed dvadym closed 8 months ago

dvadym commented 8 months ago

DataFrame API allows to perform DP on DataFrames (now only Spark DataFrames are supported). In DataFrame API privacy_key, partition_key and values are specified by column names. Currently partition_key (aka group by key) can be specified only by 1 column. This PR implements possibility to have multiple columns as partition_key .