VizierDB / vizier-scala

The Vizier kernel-free notebook programming environment
Other
34 stars 11 forks source link

Eliminate direct dependencies on pyspark #220

Open okennedy opened 2 years ago

okennedy commented 2 years ago

Per #58 we're going to want to create multiple environments... but pyspark is a really heavyweight dependency. We're going to want to remove it if possible. Specific places where it shows up:

The last point is going to be the major problem since unfortunately, pyspark has its own version of cloudpickle, and cloudpickle generates version-specific output (spark won't be able to read a function serialized by a different cloudpickle version). A few ideas:

The latter is probably what makes the most sense.

okennedy commented 1 year ago

Unfortunately, for the time being, we seem to be stuck with this. Python UDF support requires us to include type annotations from pyspark. It seems silly to force a download of a fully redundant copy of spark's ~300mb... but this is going to require more of an effort than we have at the moment. Deferring.