dfe-analytical-services / analysts-guide

A static website to hold guidance, helpful links and code examples for analysts working in DfE.
https://dfe-analytical-services.github.io/analysts-guide/
MIT License
3 stars 2 forks source link

Changing RStudio/Databricks connection guidance #57

Closed wklimowicz closed 1 month ago

wklimowicz commented 4 months ago

I want to suggest a change in how we advise people to set up an ODBC connection to Databricks. The current guidance on local RStudio to Databricks setup has two big drawbacks:

  1. Since it doesn't enforce a naming convention it's likely people in the same team/unit/division will call their connections differently, and code won't run between computers easily. The example code leaves this choice up to the analyst:
con <- DBI::dbConnect(
  odbc::odbc(),
  "INSERT_ODBC_DRIVER_NAME", # name of the ODBC driver - you chose this when you installed
  catalog = "INSERT_CATALOG"
)
  1. Related to the above, setting up a connection in a remote environment (e.g. DevOps pipelines) will be tricky, because we won't be able to add an ODBC connection in the same way we can locally on Windows. We need a way of setting up ODBC that is also possible to do via something like a YAML pipeline.

Alternative (Now Official) Method using Environment Variables

A recent addition to the odbc package has meant it's now possible to get a databricks ODBC connection set up with three environment variables. An extra variable can get spark personal cluster working too. This method is supported by the official guidance on the posit solutions website, which recommends using the odbc::databricks() function alongside environment variables.

Environment variables can be set in many ways, but simplest is probably to use usethis::edit_r_environ() and the global .Renviron:

DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net
DATABRICKS_SQL_WAREHOUSE_ID=<sql-warehouse-id>
DATABRICKS_TOKEN=<personal-access-token>
DATABRICKS_CLUSTER_ID=<personal-cluster-id>

The naming of these variables is either enforced or standard, meaning we should enforce them to be the same for everyone:

With these 4 variables, the R code would look like this. Importantly, it wouldn't have to change between different people, or even remote environments like an Azure DevOps Pipeline or Posit Connect, as long as the environment variables are set correctly for each person and machine.

# Connect to ODBC SQL Warehouse
library(DBI)
library(odbc)

con <- DBI::dbConnect(
  odbc::databricks(),
  httpPath = paste0(
      "/sql/1.0/warehouses/",
      Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID")
  )
)

odbcListObjects(con)

# Connect to Personal Compute
library(dplyr)
library(dbplyr)
library(sparklyr)

sc <- spark_connect(
  cluster_id = Sys.getenv("DATABRICKS_CLUSTER_ID"),
  method = "databricks_connect"
)

tbl(sc, in_catalog("catalog_10_gold", "information_schema", "columns")) |>
  colnames()

This means we can completely omit the manual ODBC connection set up using "ODBC Data Sources" in Windows, and just ask people to note down the relevant information and paste into the .Renviron file.

The other key benefit will be easier setup for Shiny apps and DevOps pipelines, because all the setup is done via environment variables which can just be added in the YAML or in the pipeline secrets. Here's an example DevOps Pipelines YAML:

Devops Pipeline YAML Note: This hasn't been tested successfully yet since it needs a Client ID and Client Secret created by admins in the Azure Portal. ```yaml trigger: branches: include: - "main" pool: vmImage: ubuntu-latest container: image: rocker/geospatial:latest # Most packages and speeds up setup variables: _R_CHECK_FORCE_SUGGESTS_: 'FALSE' MAKEFLAGS: -j 2 CI: TRUE # This makes `testthat::skip_on_ci()` work. # DATABRICKS_TOKEN: defined in secrets # DATABRICKS_CLIENT_SECRET: defined in secrets DATABRICKS_HOST: adb-5037484389568426.6.azuredatabricks.net DATABRICKS_CLUSTER_ID: abc12345 DATABRICKS_SQL_WAREHOUSE_ID: abc12345 DATABRICKS_CLIENT_ID: abc12345 steps: - bash: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh displayName: Setup databricks cli - bash: databricks catalogs list displayName: Test databricks cli ```

If this change is agreed, we can also help analysts with creating the environment variables with a function (for example in dfeR):

Automate Environment Variable Setup ```r setup_databricks <- function(scope = c("user", "project")) { # Copied from usethis::edit_r_environ path <- usethis:::scoped_path_r(scope, ".Renviron", envvar = "R_ENVIRON_USER") usethis::ui_info("Copy and fill out the following:") cat("\n") lines_to_write <- "DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net DATABRICKS_SQL_WAREHOUSE_ID= DATABRICKS_TOKEN= DATABRICKS_CLUSTER_ID= " usethis::ui_code_block(lines_to_write) # For 99% of people, this file won't exist -- automate the paste step if (!file.exists(path)) { fileConn <- file(path) writeLines(lines_to_write, fileConn) close(fileConn) } cat("\n") usethis::ui_todo("Fill out the environment variables, save, and restart R for changes to take effect") usethis::edit_file(path) invisible(path) } ```

Welcome any thoughts on this change, since it's a fairly major one in how we advise people to set up a connection.

wklimowicz commented 2 months ago

After some iteration I've made some tweaks, and also got a working proof of concept for a shiny app on posit connect with a databricks connection.

Local Setup

Local setup requires 3 environment variables:

DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net
DATABRICKS_SQL_WAREHOUSE_ID=/sql/1.0/warehouses/abc123...
DATABRICKS_TOKEN=dapi123abc...

These can be set either in Edit environment variables for your account in Windows, or the global .Renviron using usethis::edit_r_environ. We could adapt the snippet in the original post if we wanted to add that to the dfeR package.

Posit Connect Databricks

This requires 4 environment variables, created in the posit connect Vars pane:

DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net
DATABRICKS_SQL_WAREHOUSE_ID=/sql/1.0/warehouses/abc123...
DATABRICKS_CLIENT_ID=<azure app client id>
DATABRICKS_CLIENT_SECRET=<admin created databricks token>

The Client ID can be generated by creating an App in App Registrations on Azure Portal. Put a service ticket in to get a DATABRICKS_CLIENT_SECRET created by an admin. Note the App needs to be given database/schema level permissions, and also warehouse permissions.

R Connection Code

Whether running remotely or locally, the code stays the same. This should work between people using different Hosts, SQL warehouses, personal access tokens, etc.

library(tidyverse)
library(odbc)

con <- DBI::dbConnect(
    odbc::databricks(),
    httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID")
)

odbcListObjects(con)
Using Personal Compute Cluster Locally This only works locally, setting it up on posit connect would require it to have `pysparklyr` ```bash DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net DATABRICKS_CLUSTER_ID= DATABRICKS_TOKEN= ``` ```r library(tidyverse) library(sparklyr) library(dbplyr) sc <- spark_connect( cluster_id = Sys.getenv("DATABRICKS_CLUSTER_ID"), method = "databricks_connect" ) contract <- tbl(sc, in_catalog("catalog_40_copper_tad", "swfc", "contract")) ```