Open okennedy opened 2 years ago
Unfortunately, for the time being, we seem to be stuck with this. Python UDF support requires us to include type annotations from pyspark. It seems silly to force a download of a fully redundant copy of spark's ~300mb... but this is going to require more of an effort than we have at the moment. Deferring.
Per #58 we're going to want to create multiple environments... but pyspark is a really heavyweight dependency. We're going to want to remove it if possible. Specific places where it shows up:
client.py
:export_module
annotates functions with pyspark type annotations (#219 should eliminate this need)client.py
:get_data_frame
uses pyspark's arrow collector to connect python cells to pandas dataframes. We might be able to get away with a direct dependency onarrow
itself. (Possibly something to fix along with #92 )info.vizierdb.commands.python.PythonUDFBuilder
:pyspark.cloudpickle
is used to serialize python functions for use with Spark.The last point is going to be the major problem since unfortunately,
pyspark
has its own version ofcloudpickle
, andcloudpickle
generates version-specific output (spark won't be able to read a function serialized by a differentcloudpickle
version). A few ideas:system
python from the cell execution python. Invoke cloudpickle with the system python (i.e., don't run it in a venv).The latter is probably what makes the most sense.