fugue-project / fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
https://fugue-tutorials.readthedocs.io/
Apache License 2.0
1.92k stars 94 forks source link

[BUG] Cloudpickle is unnecessarily a hard dependency of Spark backend #478

Closed goodwanghan closed 1 year ago

goodwanghan commented 1 year ago

This change: https://github.com/fugue-project/fugue/commit/adf38849113c8936a3ed3fa138921c4a34f230b1#diff-f6002ac0db7dcceed0d45e617d3c625008cff4d1807cf64a4702821cd2aa5d17R3 introduced hard dependency of cloudpickle to fugue spark backend. And from pyspark 3.4.0, cloudpickle is no longer a dependency of pyspark.

So when users install fugue[spark] in a spark 3.4 environment, fugue may complain that the spark objects are not recognized, it is because without cloudpickle, fugue_spark could not be correctly imported.

The solution is to change cloudpickle to pickle for the specific usage inside fugue_spark.