Open timvw opened 1 year ago
@timvw I agree this could unlock quite a bit for us over time. 👁️ @fokko do you know much about this new feature?
@dataders Thanks for pinging me. I worked with Databricks' Spark connect quite a bit, and it is great to see that it is now part of Spark Open Source. I think it makes a lot of sense to add this.
@Fokko - I would be interested in your take on my interpretation of spark-connect's suitability in #821?
I have no experience with spark connect, but if the objective is to support the execution of SQL from DBT I can see how this would work.
I'm not sure it would support python models as presently implemented but this is perhaps not the intent of this issue.
Closed (as the possibility to connect to Live seems more favorable for now)
Hi! I would like to reopen this discussion as I have made a PR #899 introducing support for Spark Connect SQL models (well probably should have done this before the PR, but water under the bridge now :) ).
I believe it makes sense to introduce support for Spark Connect SQL models because it unlocks an additional way of using DBT with open source Spark without much code changes from DBT side (the implementation is based on the existing Spark Session code). Currently the only way to run DBT with open source Spark in production is using a Thrift connection, so adding at least another alternative would open up dbt to more users.
Livy as an alternative was also discussed in #821 issue. Livy would work well for SQL models, but the Livy open source project is pretty much dead. Though some cloud providers (AWS EMR, Gcloud DataProc, maybe some others) still expose Livy compatible APIs, so users using those cloud providers would benefit from dbt livy support. There is also a fairly new open source project called Lighter, which aims to replace Livy and has a Livy compatible API.
But I don't think the question should be Spark Connect OR Livy. I think we can support both, especially since supporting Spark Connect would probably not require a lot of additional effort, since the implementation is highly tied to Spark Session, which dbt already supports.
I would like to hear what dbt and the community think about introducing Spark connect SQL models and whether it's worth supporting this feature.
Regarding Python models:
Livy would be much better suited for dbt Python models as it would stick to dbt philosophy of generating the code locally and then shipping it somewhere else to execute. And it would support running arbitrary Python code remotely, not just a subset of APIs that are supported by Spark Connect, but again the open source Livy project is pretty much dead.
Spark Connect on the other hand is a fairly good alternative. It is limited in that it only supports Dataframe API and in the latest Spark release - Pandas on Pyspark API and PyTorch, but maybe that's enough for most use cases? And there are always UDFs which are also executed remotely AFAIK. It also allows easier local development as spinning up a local Spark connect cluster is very easy.
I think it would make sense to split the discussion on Python models on Spark Connect into a separate issues if anyone wants to continue discussing it.
@vakarisbk I agree 100%. I would also add two points:
I proposed #821 and agree with the recommendation to split them into two separate sets of requirements, one for spark connect as a method to support SQL and one for a means (spark connect or whatever) to implement python dbt models in OSS spark.
This ticket focusses on ising spark connect as an alternative to the thriftserver method, which only supports SQL would still bring advantages
I've not tried it but might if I get around to it, but it may well be possible to do this without any changes at all just by setting
export SPARK_REMOTE="sc://localhost"
source
However, that would bring SQL support only but would improve the current basic spark session implementation.
@ben-schreiber to be clear, I think there would be a limitation of spark connect which is highlighted by @vakarisbk
Livy would be much better suited for dbt Python models as it would stick to dbt philosophy of generating the code locally and then shipping it somewhere else to execute. And it would support running arbitrary Python code remotely, not just a subset of APIs that are supported by Spark Connect
Or to put it another way, spark connect cannot run arbitrary python remotely - AFAIK, there's no way to access an available python interpreter, and no requirement for one to be available. That's different to the approach taken by the other connectors which have all the relevant bits of python executed on the remote server. Quite possibly that's an acceptable limitation but a potentially confusing one - packages would only be installed locally, for example, whilst the configuration makes it clear this is for remote installation.
I do share the concerns around Livy's aliveness as well.
@ssabdb Agreed that the there is a limitation; I think this is the key point:
Quite possibly that's an acceptable limitation but a potentially confusing one
Additionally, since there are numerous ways to connect to and use Spark, I'm not sure a "one size fits all" approach to Python DBT models for OSS Spark is the correct one. In any event, let's leave the Python model discussion for a dedicated issue (#415 ?)
Is this your first time submitting a feature request?
Describe the feature
I would like to be able to use dbt (spark) via the Spark Connect api
Describe alternatives you've considered
We could decide not to support this
Who will this benefit?
All users that have a Spark Connect endpoint available
Are you interested in contributing this feature?
Yes
Anything else?
https://spark.apache.org/docs/latest/spark-connect-overview.html