Add support for Spark Connect (SQL models)

vakarisbk commented 12 months ago

partially resolves #814 docs dbt-labs/docs.getdbt.com/#

Problem

dbt-spark has limited options for open-source Spark integrations. Currently, the only available method to run dbt with open-source Spark in production is through a Thrift connection. However, a Thrift connection isn't suitable for all use cases. For instance, it doesn't support thrift over HTTP. Also, the PyHive project, that dbt thrift relies on, is unsupported (at least according to their GitHub page).

Solution

Propose introducing support for Spark Connect (for SQL models only).

Checklist

[x] I have read the contributing guide and understand what's expected of me
[x] I have run this code in development and it appears to resolve the stated issue
[x] This PR includes tests, or tests are not required/relevant for this PR
[ ] This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

How to test locally?

Follow the instructions in the Spark documentation to download Spark distribution. https://spark.apache.org/docs/latest/spark-connect-overview.html
Start spark connect server with Hive metastore enabled ./start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0 --conf spark.sql.catalogImplementation=hive

Add the Spark Connect configuration to your profiles.yml:

spark_connect:
outputs:
dev:
  host: localhost
  method: connect
  port: 15002
  schema: default
  type: spark
target: dev

Known issues: #901

cla-bot[bot] commented 12 months ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Vakaris. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] commented 12 months ago

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Vakaris. This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

vakarisbk commented 7 months ago

Seeing as there is some recent activity on Issue #814, and knowing that there are at least a couple of people actively using this fork, I've updated it. Looking forward for any insights regarding the implementation, as well as the likelihood of this pr getting merged.

dbt-labs / dbt-spark