VizierDB / web-ui

Web User Interface
Apache License 2.0
12 stars 2 forks source link

Allow SQL cells to access exported python functions as UDFs #225

Closed okennedy closed 3 years ago

okennedy commented 4 years ago

Recent versions of the Vizier-API allow python functions to be exported for import/use in later cells. Currently this export only supports python cells. It would be nifty if we could export python functions to be invoked from a SQL cell as a UDF. Given that spark supports python function execution, this should be feasible, but some study will be required to figure out exactly how to do it.

okennedy commented 4 years ago

Privatization may be making this a little more difficult than it needs to be, so I'm going to start by cataloguing what I know.

SparkSQL has a PythonUDF expression, which seems to take in some basic details: name, type details, arguments, etc... The UDF references a PythonFunction object which, unfortunately, is private to the spark package.

The PythonFunction object is defined as:

Function registration happens through UDFRegistration.registerPython which is a shorthand for the normal UDF registration (spark.udf.functionRegistry.createOrReplaceTempFunction(name, Seq[Expression] => Expression) and UserDefinedPythonFunction, which is a utility class that creates PythonUDFs

Assuming we can reverse engineer PythonFunction, this should be relatively straightforward.

okennedy commented 4 years ago

From the Vizier side, web-api-async uses Mimir's Blob storage to store Python UDFs. Here's an example of what a function would look like:

@vizierdb.export_module_decorator
def apply_foo(a):
    return a + 1
okennedy commented 4 years ago

A few more notes.

Python execution seems to start with PythonRunner, which passes pythonExec (the path to the python binary), to PythonWorkerFactory. This, in turn, starts up the python worker module (by default: pyspark.worker) which connects via a py4j gateway.

In the worker, the actual execution happens here, where the function reads a UDF through pickle (actually seems like it reads a pickled instance of the UDF wrapper class

More precisely, it looks like python UDFs are getting serialized here, using CloudPickleSerializer, which is the head of a long chain of dependencies to CloudPickler.

In short, what we need to get a PythonFunction is call cloudpickle.dumps() on the function.

okennedy commented 4 years ago

Since Vizier dumps out Python code in plaintext, we'd need to go through python anyway to pickle the result. Perhaps we don't actually want to do this...

One approach might be to just run cloudpickle.dumps() once to get a "magic" serialized function:


def invoke_vizier_udf(*args):
  fn = args[0]
  args = args[1:]
  class VizierUDFWrapper:
    def export_module_decorator(fn):
      self.fn = fn
      return fn
  vizierdb = VizierUDFWrapper()
  exec(fn)
  return vizierdb.fn(*args)
okennedy commented 3 years ago

image