Intel-bigdata / Spark-PMoF

Spark Shuffle Optimization with RDMA+AEP
Apache License 2.0
30 stars 22 forks source link

merge with upstream Spark ? #59

Closed Tagar closed 4 years ago

Tagar commented 4 years ago

Not sure if this was discussed, but is this possible to merge this work with upstream Spark? Or the plan is to continue to maintain Spark-PMoF as a separate project?

Thank you

tanghaodong25 commented 4 years ago

@Tagar We have tried to merge the persistent memory based shuffle manager to upstream, please find the patch here: https://github.com/apache/spark/pull/24322. There's no way to build native code in Spark, so we'll maintain Spark-PMoF as an external package for Spark.

Tagar commented 4 years ago

@tanghaodong25 understood - thanks.

Perhaps post native code as a separate package/ dependency outside of Spark.

Core PySpark nowadays for example has a hard dependency on pyarrow https://github.com/apache/spark/blob/master/python/setup.py#L221 which itself has native libraries ..

Another precedent is IntelMKL library / gfortran etc that can be installed to boost up Spark ML with native libraries.