databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
212 stars 114 forks source link

Connection behind the proxy #373

Closed mpzrtauio closed 1 year ago

mpzrtauio commented 1 year ago

Describe the bug

I would like to connect Azure Databricks behind a proxy and it failed. I set http_proxy, https_proxy, HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables. I tested with dbt --debug debug:

dbt --debug debug 
22:44:53  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f56b600fa60>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f56b41caca0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f56b41ca490>]}
22:44:53  Running with dbt=1.5.1
22:44:53  running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'log_cache_events': 'False', 'write_json': 'True', 'partial_parse': 'True', 'cache_selected_only': 'False', 'warn_error': 'None', 'version_check': 'True', 'fail_fast': 'False', 'log_path': '/tmp/repo/dbt/logs', 'debug': 'True', 'profiles_dir': '/tmp/repo/dbt', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'log_format': 'default', 'introspect': 'True', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'static_parser': 'True', 'target_path': 'None', 'send_anonymous_usage_stats': 'True'}
22:44:53  dbt version: 1.5.1
22:44:53  python version: 3.8.16
22:44:53  python path: /usr/local/bin/python
22:44:53  os info: Linux-5.18.13-200.fc36.x86_64-x86_64-with-glibc2.2.5
22:44:53  Using profiles.yml file at /tmp/repo/dbt/profiles.yml
22:44:53  Using dbt_project.yml file at /tmp/repo/dbt/dbt_project.yml
22:44:53  Configuration:
22:44:54    profiles.yml file [OK found and valid]
22:44:54    dbt_project.yml file [OK found and valid]
22:44:54  Required dependencies:
22:44:54  Executing "git --help"
22:44:54  STDOUT: "b"usage: git [--version] [--help] [-C <path>] [-c <name>=<value>]\n           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]\n           [-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare]\n           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]\n           <command> [<args>]\n\nThese are common Git commands used in various situations:\n\nstart a working area (see also: git help tutorial)\n   clone             Clone a repository into a new directory\n   init              Create an empty Git repository or reinitialize an existing one\n\nwork on the current change (see also: git help everyday)\n   add               Add file contents to the index\n   mv                Move or rename a file, a directory, or a symlink\n   restore           Restore working tree files\n   rm                Remove files from the working tree and from the index\n   sparse-checkout   Initialize and modify the sparse-checkout\n\nexamine the history and state (see also: git help revisions)\n   bisect            Use binary search to find the commit that introduced a bug\n   diff              Show changes between commits, commit and working tree, etc\n   grep              Print lines matching a pattern\n   log               Show commit logs\n   show              Show various types of objects\n   status            Show the working tree status\n\ngrow, mark and tweak your common history\n   branch            List, create, or delete branches\n   commit            Record changes to the repository\n   merge             Join two or more development histories together\n   rebase            Reapply commits on top of another base tip\n   reset             Reset current HEAD to the specified state\n   switch            Switch branches\n   tag               Create, list, delete or verify a tag object signed with GPG\n\ncollaborate (see also: git help workflows)\n   fetch             Download objects and refs from another repository\n   pull              Fetch from and integrate with another repository or a local branch\n   push              Update remote refs along with associated objects\n\n'git help -a' and 'git help -g' list available subcommands and some\nconcept guides. See 'git help <command>' or 'git help <concept>'\nto read about a specific subcommand or concept.\nSee 'git help git' for an overview of the system.\n""
22:44:54  STDERR: "b''"
22:44:54   - git [OK found]

22:44:54  Connection:
22:44:54    host: adb-***.3.azuredatabricks.net
22:44:54    http_path: sql/protocolv1/o/***/***
22:44:54    schema: abc
22:44:54  Acquiring new databricks connection 'debug'
22:44:54  Using databricks connection "debug"
22:44:54  On debug: select 1 as id
22:44:54  Opening a new connection, currently in state init
22:44:55  Databricks adapter: <class 'databricks.sql.exc.RequestError'>: Error during request to server
22:44:55  Databricks adapter: attempt: 1/30
22:44:55  Databricks adapter: bounded-retry-delay: None
22:44:55  Databricks adapter: elapsed-seconds: 0.8554244041442871/900.0
22:44:55  Databricks adapter: error-message: 
22:44:55  Databricks adapter: http-code: 403
22:44:55  Databricks adapter: method: OpenSession
22:44:55  Databricks adapter: no-retry-reason: non-retryable error
22:44:55  Databricks adapter: original-exception: 
22:44:55  Databricks adapter: query-id: None
22:44:55  Databricks adapter: session-id: None
22:44:55  Databricks adapter: Error while running:
select 1 as id
22:44:55  Databricks adapter: Database Error
  Error during request to server
22:44:55  On debug: No close available on handle
22:44:55    Connection test: [ERROR]

22:44:55  1 check failed:
22:44:55  dbt was unable to connect to the specified database.
The database returned the following error:

  >Runtime Error
  Database Error
    Error during request to server

Check your database credentials and try again. For more information, visit:
https://docs.getdbt.com/docs/configure-your-profile

22:44:55  Command `dbt debug` failed at 22:44:55.582692 after 2.35 seconds
22:44:55  Connection 'debug' was properly closed.
22:44:55  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f56b600fa60>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f56981ac880>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f56981b0b50>]}
22:44:55  Flushing usage events
22:44:55  Error sending anonymous usage statistics. Disabling tracking for this execution. If you wish to permanently disable tracking, see: https://docs.getdbt.com/reference/global-configs#send-anonymous-usage-stats.

I used the exactly same configuration on another VM that is not behind a proxy, and everything worked there.

Steps To Reproduce

Run dbt --debug debug command on VM that is behind proxy and set Databricks connection in profiles.yml

Expected behavior

The 'dbt --debug debug' command produces the following logs, just like what I obtained when running it on the VM that is not behind a proxy:

22:53:42  Acquiring new databricks connection 'debug'
22:53:42  Using databricks connection "debug"
22:53:42  On debug: select 1 as id
22:53:42  Opening a new connection, currently in state init 
22:57:19  SQL status: OK in 216.2899932861328 seconds 
22:57:19  On debug: Close 
22:57:19    Connection test: [OK connection ok] 

Screenshots and log output

It can be found above.

System information

The output of dbt --version:

Core:
  - installed: 1.5.1
  - latest:    1.5.1 - Up to date!

Plugins:
  - postgres:   1.5.1 - Up to date!
  - databricks: 1.5.4 - Up to date!
  - spark:      1.5.0 - Up to date!

The operating system you're using (uname -a):

Linux airflow2-worker-1 5.18.13-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jul 22 14:03:36 UTC 2022 x86_64 GNU/Linux

The output of python --version:

Python 3.8.16
williamxnguyen commented 1 year ago

Having the same issue.

mpzrtauio commented 1 year ago

I found a solution by downgrading the databricks-sql-connector package, which makes the access to Databricks behind the proxy work in the following combination:

susodapop commented 1 year ago

Thanks for reporting this. This regression is caused by databricks-sql-connector as of 2.6.0. For that release we rewrote the Thrift HTTP handler to use a connection pool from urllib3. That rewrite improved connector performance by about 50% but also meant rewriting our proxy support.

You can downgrade to an older version of dbt-databricks to make proxies work again but of course we'd like to fix the proxy support so you can take advantage of the performance improvements in newer databricks-sql-connector releases.

To debug this, we need to see the actual log messages from databricks-sql-connector which dbt-databricks doesn't currently surface in any meaningful way. So to proceed with this fix we need to follow these steps:

  1. Merge this PR which captures the databricks-sql-connector log messages https://github.com/databricks/dbt-databricks/pull/364
  2. Release a new patch release of dbt-databricks that incorporates this change
  3. Attempt to connect behind a proxy and capture the newly available logs
  4. Use those logs to patch databricks-sql-connector
  5. Update dbt-databricks to use the newer patch of databricks-sql-connector
pkupidura commented 1 year ago

Hey @susodapop, I've noticed that databricks-sql-connector==2.8.0 fixed the issue with proxies (as mentioned in the release notes).

Is there a chance for a dbt-databricks==1.6.0 release with fixed connector?

benc-db commented 1 year ago

@pkupidura dbt-databricks 1.6.3 updates to databricks-sql-connector 2.9.3 👍