dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
395 stars 221 forks source link

[ADAP-667] [Feature] python model support via livy #821

Closed ssabdb closed 8 months ago

ssabdb commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

The method: thrift method of connecting to spark-sql can never support python models, because it is designed for jdbc clients like beeline.

The proposal is to support dbt python modules using a connection to a livy server, which is designed to provide a restful api around spark contexts which can be hosted remotely or locally.

An implementation would implement a method: livy which would allow both SQL and python models to be submitted via livy.

Describe alternatives you've considered

Spark-session support might be a way of implementing this, though it would be quite heavyweight and suffer from the existing limitations of spark-session around the very complex configuration required to allow the dozens of potential spark deployment scenarios which would be difficult to capture in dbt configuration.

I do not know if hiveserver2 could be adapted to allow pyspark python to be submitted to it, but I doubt it would be easy and it was initially built to support hive which is a sql language. Internally, dbt-spark uses pyhive to connect to HS2 which suffers from the same problem.

Who will this benefit?

Anyone using apache based implementations or other "on-prem" implementations of spark, e.g. hadoop or other standalone spark clusters, Amazon EMR (livy support) or even as an alternative to the Bigquery implemention which currently relies on Google's jobapi .

It also allows a relatively vendor agnostic interface to other spark implementations which would allow support from some of the more obscure cloud platforms with hadoop implementations.

Are you interested in contributing this feature?

Yes

Anything else?

There is already an implementation of dbt-spark by Cloudera, which is dbt-spark-livy which is licenced under the Apache Licence. It is particularly informative to examine livysession in this repository, which currently just submits to a sql session.

An implementation would extend this capability and merge it into dbt-spark, allowing both sql and python models to be executed via livy/dbt.

It's surprising there doesn't seem to have been a PR to pull this implementation into this repository.

cristianodarocha commented 1 year ago

That would be amazing. I am interested in contributing to this feature.

dataders commented 1 year ago

@ssabdb any chance you could rundown on the pro's & con's of supporting Livy vs SparkConnect (#814)?

ssabdb commented 1 year ago

@dataders, interesting, I wasn't aware of this development.

Obvious thoughts from a high level scan of the design docs and spark jira ticket :

It's not clear to me that this spark connect can remotely execute python code, but rather it allows the remote execution of the spark dataframe api via a local python process to a lightweight spark client. This means, for example, that basic python like the re module or numpy, scikit-learn would not be available.

Spark Connect is not meant to be the generic interface for everything that Spark can do, but provides access to an opinionated subset of Spark features (design docs)

My understanding of the intent behind the DBT python models is not an intention to support spark, but rather to enable the execution of python in a remote runtime. An abstraction supporting only the spark dataframe api will, by necessity, only support calls to the spark dataframe api, and any calls which are not made to a spark context will be executed inside the dbt process.

However, this API does support SQL and could be used to submit SQL if required. Note though:

Spark Connect is not meant to replace the HiveServer2 interface or SQL (design docs)

The UDF implementation supports python UDFs by spinning up python sidecars inside the process as required. This means that outside of UDFs there doesn't seem to be support for remote python execution.

Apache Livy, on the other hand is a remote spark context management tool. That means a python driver process is started along with a spark JVM process for pyspark sessions which will be available to execute python code in the driver process.

I do wonder though if there are other alternative remote code execution clients for python that could themselves use spark. It feels like support for grid computing tools, e.g. slurm might allow a more generic python (or other) runtime which could then themselves execute spark if required.

But, to stay consistent with the other platform implementations I'd say Livy is probably the most consistent but the project is a bit dormant. Apache Toree might work too, but is intended for notebooks.

github-actions[bot] commented 9 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 8 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.