OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
274 stars 77 forks source link

Spark 2.X.X support? #234

Open SemyonSinchenko opened 2 years ago

SemyonSinchenko commented 2 years ago

Question

Is there support of the 2.X.X versions of Apache Spark?

Further Information

I see in pyproject.toml pyspark 3.2.0 dependency. But in real enerprise and on-premise clusters typically version is 2.X.X. Is there support of any Spark version except 3.2.0?

Screenshots

If applicable, add screenshots to help explain your question.

System Information

Additional Context

It is good to see the list of supported Spark/Besm versions but I couldn't find it. Maybe there is one? In that case could you please get me a link? Thank you!

dvadym commented 2 years ago

We haven't tested yet on 2.X, though I think it should be easy to make support 2.X (or even it might work with 2.X out of the box). That's because PipelineDP needs only some basic APIs from RDD (no yet support of other Spark API as DataFrames) - like map, reduceByKey, join etc. You can see all used Spark API in SparkRDDBackend class. If you have any feedback on using Spark please LMK. Also if you test it with Spark 2.* please LMK results.

In the next release, we will remove limitation on 3.2.0.

SemyonSinchenko commented 2 years ago

Thanks a lot for a such fast answer. I'll write a comment here about my tests on Spark 2.3.0.