dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
399 stars 227 forks source link

[ADAP-658] [Feature] Spark Connect as connection method #814

Open timvw opened 1 year ago

timvw commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

I would like to be able to use dbt (spark) via the Spark Connect api

Describe alternatives you've considered

We could decide not to support this

Who will this benefit?

All users that have a Spark Connect endpoint available

Are you interested in contributing this feature?

Yes

Anything else?

https://spark.apache.org/docs/latest/spark-connect-overview.html

dataders commented 1 year ago

@timvw I agree this could unlock quite a bit for us over time. 👁️ @fokko do you know much about this new feature?

Fokko commented 1 year ago

@dataders Thanks for pinging me. I worked with Databricks' Spark connect quite a bit, and it is great to see that it is now part of Spark Open Source. I think it makes a lot of sense to add this.

ssabdb commented 1 year ago

@Fokko - I would be interested in your take on my interpretation of spark-connect's suitability in #821?

I have no experience with spark connect, but if the objective is to support the execution of SQL from DBT I can see how this would work.

I'm not sure it would support python models as presently implemented but this is perhaps not the intent of this issue.

timvw commented 1 year ago

Closed (as the possibility to connect to Live seems more favorable for now)

vakarisbk commented 1 year ago

Hi! I would like to reopen this discussion as I have made a PR #899 introducing support for Spark Connect SQL models (well probably should have done this before the PR, but water under the bridge now :) ).

I believe it makes sense to introduce support for Spark Connect SQL models because it unlocks an additional way of using DBT with open source Spark without much code changes from DBT side (the implementation is based on the existing Spark Session code). Currently the only way to run DBT with open source Spark in production is using a Thrift connection, so adding at least another alternative would open up dbt to more users.

Livy as an alternative was also discussed in #821 issue. Livy would work well for SQL models, but the Livy open source project is pretty much dead. Though some cloud providers (AWS EMR, Gcloud DataProc, maybe some others) still expose Livy compatible APIs, so users using those cloud providers would benefit from dbt livy support. There is also a fairly new open source project called Lighter, which aims to replace Livy and has a Livy compatible API.

But I don't think the question should be Spark Connect OR Livy. I think we can support both, especially since supporting Spark Connect would probably not require a lot of additional effort, since the implementation is highly tied to Spark Session, which dbt already supports.

I would like to hear what dbt and the community think about introducing Spark connect SQL models and whether it's worth supporting this feature.

vakarisbk commented 1 year ago

Regarding Python models:

Livy would be much better suited for dbt Python models as it would stick to dbt philosophy of generating the code locally and then shipping it somewhere else to execute. And it would support running arbitrary Python code remotely, not just a subset of APIs that are supported by Spark Connect, but again the open source Livy project is pretty much dead.

Spark Connect on the other hand is a fairly good alternative. It is limited in that it only supports Dataframe API and in the latest Spark release - Pandas on Pyspark API and PyTorch, but maybe that's enough for most use cases? And there are always UDFs which are also executed remotely AFAIK. It also allows easier local development as spinning up a local Spark connect cluster is very easy.

I think it would make sense to split the discussion on Python models on Spark Connect into a separate issues if anyone wants to continue discussing it.

ben-schreiber commented 9 months ago

@vakarisbk I agree 100%. I would also add two points:

  1. Since the SparkSession used for executing SQL with Spark Connect is exactly the one we would use to execute Python, the additional work needed to add support for DBT Python models on Spark Connect as well is low hanging.
  2. Based on what I've seen (and you mentioned), the Livy project is an older technology which are dying and Thrift supports only SQL. Additionally, Spark Connect seems to the incoming generation of technology for remotely connecting to a Spark application.
ssabdb commented 8 months ago

I proposed #821 and agree with the recommendation to split them into two separate sets of requirements, one for spark connect as a method to support SQL and one for a means (spark connect or whatever) to implement python dbt models in OSS spark.

This ticket focusses on ising spark connect as an alternative to the thriftserver method, which only supports SQL would still bring advantages

I've not tried it but might if I get around to it, but it may well be possible to do this without any changes at all just by setting

export SPARK_REMOTE="sc://localhost" source

However, that would bring SQL support only but would improve the current basic spark session implementation.

@ben-schreiber to be clear, I think there would be a limitation of spark connect which is highlighted by @vakarisbk

Livy would be much better suited for dbt Python models as it would stick to dbt philosophy of generating the code locally and then shipping it somewhere else to execute. And it would support running arbitrary Python code remotely, not just a subset of APIs that are supported by Spark Connect

Or to put it another way, spark connect cannot run arbitrary python remotely - AFAIK, there's no way to access an available python interpreter, and no requirement for one to be available. That's different to the approach taken by the other connectors which have all the relevant bits of python executed on the remote server. Quite possibly that's an acceptable limitation but a potentially confusing one - packages would only be installed locally, for example, whilst the configuration makes it clear this is for remote installation.

I do share the concerns around Livy's aliveness as well.

ben-schreiber commented 8 months ago

@ssabdb Agreed that the there is a limitation; I think this is the key point:

Quite possibly that's an acceptable limitation but a potentially confusing one

Additionally, since there are numerous ways to connect to and use Spark, I'm not sure a "one size fits all" approach to Python DBT models for OSS Spark is the correct one. In any event, let's leave the Python model discussion for a dedicated issue (#415 ?)