Closed wklimowicz closed 1 month ago
After some iteration I've made some tweaks, and also got a working proof of concept for a shiny app on posit connect with a databricks connection.
Local setup requires 3 environment variables:
DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net
DATABRICKS_SQL_WAREHOUSE_ID=/sql/1.0/warehouses/abc123...
DATABRICKS_TOKEN=dapi123abc...
These can be set either in Edit environment variables for your account
in Windows, or the global .Renviron using usethis::edit_r_environ
. We could adapt the snippet in the original post if we wanted to add that to the dfeR package.
This requires 4 environment variables, created in the posit connect Vars
pane:
DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net
DATABRICKS_SQL_WAREHOUSE_ID=/sql/1.0/warehouses/abc123...
DATABRICKS_CLIENT_ID=<azure app client id>
DATABRICKS_CLIENT_SECRET=<admin created databricks token>
The Client ID can be generated by creating an App in App Registrations on Azure Portal. Put a service ticket in to get a DATABRICKS_CLIENT_SECRET
created by an admin. Note the App needs to be given database/schema level permissions, and also warehouse permissions.
Whether running remotely or locally, the code stays the same. This should work between people using different Hosts, SQL warehouses, personal access tokens, etc.
library(tidyverse)
library(odbc)
con <- DBI::dbConnect(
odbc::databricks(),
httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID")
)
odbcListObjects(con)
I want to suggest a change in how we advise people to set up an ODBC connection to Databricks. The current guidance on local RStudio to Databricks setup has two big drawbacks:
Alternative (Now Official) Method using Environment Variables
A recent addition to the
odbc
package has meant it's now possible to get a databricks ODBC connection set up with three environment variables. An extra variable can get spark personal cluster working too. This method is supported by the official guidance on the posit solutions website, which recommends using theodbc::databricks()
function alongside environment variables.Environment variables can be set in many ways, but simplest is probably to use
usethis::edit_r_environ()
and the global.Renviron
:The naming of these variables is either enforced or standard, meaning we should enforce them to be the same for everyone:
DATABRICKS_TOKEN
andDATABRICKS_HOST
are enforced by R packages (odbc::databricks
)DATABRICKS_CLUSTER_ID
follows the naming convention here: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/r/DATABRICKS_SQL_WAREHOUSE_ID
follows the naming convention here: https://docs.databricks.com/en/dev-tools/sql-execution-tutorial.html.With these 4 variables, the R code would look like this. Importantly, it wouldn't have to change between different people, or even remote environments like an Azure DevOps Pipeline or Posit Connect, as long as the environment variables are set correctly for each person and machine.
This means we can completely omit the manual ODBC connection set up using "ODBC Data Sources" in Windows, and just ask people to note down the relevant information and paste into the
.Renviron
file.The other key benefit will be easier setup for Shiny apps and DevOps pipelines, because all the setup is done via environment variables which can just be added in the YAML or in the pipeline secrets. Here's an example DevOps Pipelines YAML:
Devops Pipeline YAML
Note: This hasn't been tested successfully yet since it needs a Client ID and Client Secret created by admins in the Azure Portal. ```yaml trigger: branches: include: - "main" pool: vmImage: ubuntu-latest container: image: rocker/geospatial:latest # Most packages and speeds up setup variables: _R_CHECK_FORCE_SUGGESTS_: 'FALSE' MAKEFLAGS: -j 2 CI: TRUE # This makes `testthat::skip_on_ci()` work. # DATABRICKS_TOKEN: defined in secrets # DATABRICKS_CLIENT_SECRET: defined in secrets DATABRICKS_HOST: adb-5037484389568426.6.azuredatabricks.net DATABRICKS_CLUSTER_ID: abc12345 DATABRICKS_SQL_WAREHOUSE_ID: abc12345 DATABRICKS_CLIENT_ID: abc12345 steps: - bash: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh displayName: Setup databricks cli - bash: databricks catalogs list displayName: Test databricks cli ```If this change is agreed, we can also help analysts with creating the environment variables with a function (for example in
dfeR
):Automate Environment Variable Setup
```r setup_databricks <- function(scope = c("user", "project")) { # Copied from usethis::edit_r_environ path <- usethis:::scoped_path_r(scope, ".Renviron", envvar = "R_ENVIRON_USER") usethis::ui_info("Copy and fill out the following:") cat("\n") lines_to_write <- "DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net DATABRICKS_SQL_WAREHOUSE_ID=Welcome any thoughts on this change, since it's a fairly major one in how we advise people to set up a connection.