Deriving Spark DataFrame schema on converting from RDD to DataFrame

OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.

https://pipelinedp.io/

Apache License 2.0

270 stars 75 forks source link

Deriving Spark DataFrame schema on converting from RDD to DataFrame #508

Closed dvadym closed 7 months ago

dvadym commented 7 months ago

In case if the output is not empty the schema can be deduced automatically, but if the output is empty (e.g. because all partitions are dropped by partition selection) there was an example "Empty RDD". Setting the schema fixes that.