dfe-analytical-services / analysts-guide

A static website to hold guidance, helpful links and code examples for analysts working in DfE.
https://dfe-analytical-services.github.io/analysts-guide/
MIT License
5 stars 3 forks source link

Updates to RStudio/Databricks connection guidance #78

Closed jen-machin closed 2 months ago

jen-machin commented 3 months ago

Overview of changes

57 raised potential suggested changes to the existing guidance on setting up ODBC connections between RStudio and Databricks that avoids inconsistency in naming conventions for connections, and allows for ODBC connections in remote environments (e.g. DevOps pipelines)

Detailed description of changes

The guidance was originally drafted by @dfe-nt (thanks Nick!) but has been heavily modified to account for the new changes. The main updates are:

It is anticipated that eventually the .Renviron changes will form part of dfeR but we wanted to get at least some guidance out on the guide for now, and can modify it to suit when changes have been made in dfeR.

Areas to focus on when reviewing

Checklist before requesting a review

dfe-nt commented 3 months ago

Guidance in it's current form doesn't work. The environmental variable you need to define to actually connect (the http path) isn't mentioned here, and passing the cluster ID to odbc::databricks will result in an error.

Similarly, it currently states that sparklyr is needed to use a personal cluster - this is incorrect. The exact same method works both for here.

I would recommend avoiding any mention of sparklyr from this page as it's a completely different method that warrants it's own guidance. In addition I'd want some more testing of installing sparklyr on 'fresh' laptops as when I tried it it took a month of faffing around in python virtual environment hell on calls with Posit to get working. This was when the library and it's dependencies were newer and so this may have been improved, but I would caution casually throwing out "use sparklyr" as the solution for using personal clusters.

I've done some adjustments in a new branch called databricks_new. This brings the guidance for SQL warehouses and personal clusters into a single page, and has been tested by myself and my GSS intern (who has a 'fresh' laptop) so can be confident it's correct and working.

wklimowicz commented 3 months ago

For what it's worth, when I tried sparklyr and Python recently the process with creating virtual environments was very smooth, but I agree it's a different method and could be seperate. Most people will only need the odbc method.

As an aside, do we understand why exactly someone would use a personal cluster over an SQL warehouse when connecting locally with R? The only use case I could find that you can interact with Unity Catalog volumes using personal clusters, but can't with odbc. Speed and functionality seems to be the same otherwise, although personal compute takes much longer to start up compared to SQL warehouse.

If that's the case (Volumes are the only difference), we could just point people to that explicitly, saying "if you need to interact with unity catalog volumes use sparklyr + personal compute".

dfe-nt commented 3 months ago

For what it's worth, when I tried sparklyr and Python recently the process with creating virtual environments was very smooth, but I agree it's a different method and could be seperate. Most people will only need the odbc method.

As an aside, do we understand why exactly someone would use a personal cluster over an SQL warehouse when connecting locally with R? The only use case I could find that you can interact with Unity Catalog volumes using personal clusters, but can't with odbc. Speed and functionality seems to be the same otherwise, although personal compute takes much longer to start up compared to SQL warehouse.

If that's the case (Volumes are the only difference), we could just point people to that explicitly, saying "if you need to interact with unity catalog volumes use sparklyr + personal compute".

Volumes is one use case. The other is that I don't think an analyst can create their own SQL warehouse (I'm not sure because I have DB admin permissions so I can, but I think originally this wasn't the case), whereas they can get up and running with a personal cluster immediately under their own steam.

Another is just that personal compute is more versatile so if you do have a need to use DataBricks in the platform (workflows for RAP, notebooks for documentation) you might as well just use one more versatile compute resource than spinning up more of different kinds to keep costs to the Department down

jen-machin commented 3 months ago

Guidance in it's current form doesn't work. The environmental variable you need to define to actually connect (the http path) isn't mentioned here, and passing the cluster ID to odbc::databricks will result in an error.

Similarly, it currently states that sparklyr is needed to use a personal cluster - this is incorrect. The exact same method works both for here.

I would recommend avoiding any mention of sparklyr from this page as it's a completely different method that warrants it's own guidance. In addition I'd want some more testing of installing sparklyr on 'fresh' laptops as when I tried it it took a month of faffing around in python virtual environment hell on calls with Posit to get working. This was when the library and it's dependencies were newer and so this may have been improved, but I would caution casually throwing out "use sparklyr" as the solution for using personal clusters.

I've done some adjustments in a new branch called databricks_new. This brings the guidance for SQL warehouses and personal clusters into a single page, and has been tested by myself and my GSS intern (who has a 'fresh' laptop) so can be confident it's correct and working.

Modifying the connection code to this works for me:

library(odbc)

con <- DBI::dbConnect(
  odbc::databricks(),
  httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID"),
  workspace = Sys.getenv("DATABRICKS_HOST")
)

odbcListObjects(con)
dfe-nt commented 3 months ago

Guidance in it's current form doesn't work. The environmental variable you need to define to actually connect (the http path) isn't mentioned here, and passing the cluster ID to odbc::databricks will result in an error. Similarly, it currently states that sparklyr is needed to use a personal cluster - this is incorrect. The exact same method works both for here. I would recommend avoiding any mention of sparklyr from this page as it's a completely different method that warrants it's own guidance. In addition I'd want some more testing of installing sparklyr on 'fresh' laptops as when I tried it it took a month of faffing around in python virtual environment hell on calls with Posit to get working. This was when the library and it's dependencies were newer and so this may have been improved, but I would caution casually throwing out "use sparklyr" as the solution for using personal clusters. I've done some adjustments in a new branch called databricks_new. This brings the guidance for SQL warehouses and personal clusters into a single page, and has been tested by myself and my GSS intern (who has a 'fresh' laptop) so can be confident it's correct and working.

Modifying the connection code to this works for me:

library(odbc)

con <- DBI::dbConnect(
  odbc::databricks(),
  httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID"),
  workspace = Sys.getenv("DATABRICKS_HOST")
)

odbcListObjects(con)

I actually prefer this than the alternative method of storing the HTTP path as the ID because it's more granular and a bit more accurate as to what the variable actually is.

wklimowicz commented 3 months ago
library(odbc)

con <- DBI::dbConnect(
  odbc::databricks(),
  httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID"),
  workspace = Sys.getenv("DATABRICKS_HOST")
)

odbcListObjects(con)

I'd just suggest excluding the workspace = Sys.getenv("DATABRICKS_HOST") argument because it's the default: image

dfe-nt commented 3 months ago
library(odbc)

con <- DBI::dbConnect(
  odbc::databricks(),
  httpPath = Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID"),
  workspace = Sys.getenv("DATABRICKS_HOST")
)

odbcListObjects(con)

I'd just suggest excluding the workspace = Sys.getenv("DATABRICKS_HOST") argument because it's the default: image

This is what I tried originally and it didn't work, if we assume that DATABRICKS_SQL_WAREHOUSE_ID is actually the ID. In the updated guidance it does seem that it's actually the SQL warehouse path. I'm not sure that we should be calling it ID in that case.

Edit: My previous expression of preferring that route was on the misunderstanding that it would work with just the ID and workspace. If you still need the path then I think a) call it the path, it's confusing to name one thing something else that is also a legitimate thing, b) the workspace argument is redundant and therefore not required.

wklimowicz commented 3 months ago

Just saw your edit -- we agree on omitting workspace.

On DATABRICKS_SQL_WAREHOUSE_ID vs another name, I agree we should call it something else. This is my fault because initially I was trying to find standardised names for things, and I found this which had the ID (last bit of the httpPath) as DATABRICKS_SQL_WAREHOUSE_ID, which would imply having this;

con <- DBI::dbConnect(
  odbc::databricks(),
  httpPath = paste0("/sql/1.0/warehouses/", Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID"))
)

But then decided it's much cleaner to have the entire httpPath as one variable, so something like DATABRICKS_SQL_WAREHOUSE_PATH?

jen-machin commented 3 months ago

As an aside, do we understand why exactly someone would use a personal cluster over an SQL warehouse when connecting locally with R? The only use case I could find that you can interact with Unity Catalog volumes using personal clusters, but can't with odbc. Speed and functionality seems to be the same otherwise, although personal compute takes much longer to start up compared to SQL warehouse.

I think the SQL warehouse will be used by the vast majority of our analysts who already have existing analytical pipelines set up, since they'll just be using the new ODBC connection to Databricks to replace an existing connection to data stored in SQL Server currently. Am I right in assuming that eventually all data in SQL Server will be migrated to a SQL Warehouse in Databricks? I therefore wonder if the concept of connecting via a personal cluster should be kept on a separate page as it could quickly become confusing for most people.

If we do plan to keep it in the page then I think we need a very clear argument AND examples of the use case for each, e.g. "If your code currently looks like x, use SQL warehouse" "If you need to do y, use a personal cluster"

Also my interpretation of this whole thing might be completely wrong due to my limited understanding of data inside Databricks! If we anticipate that most of our analysts will in fact need a personal cluster connected to RStudio to do access the data that they currently use, then we should definitely keep it on the same page, but make it very clear about how users will know which one to use and when.

dfe-nt commented 3 months ago

As an aside, do we understand why exactly someone would use a personal cluster over an SQL warehouse when connecting locally with R? The only use case I could find that you can interact with Unity Catalog volumes using personal clusters, but can't with odbc. Speed and functionality seems to be the same otherwise, although personal compute takes much longer to start up compared to SQL warehouse.

I think the SQL warehouse will be used by the vast majority of our analysts who already have existing analytical pipelines set up, since they'll just be using the new ODBC connection to Databricks to replace an existing connection to data stored in SQL Server currently. Am I right in assuming that eventually all data in SQL Server will be migrated to a SQL Warehouse in Databricks? I therefore wonder if the concept of connecting via a personal cluster should be kept on a separate page as it could quickly become confusing for most people.

If we do plan to keep it in the page then I think we need a very clear argument AND examples of the use case for each, e.g. "If your code currently looks like x, use SQL warehouse" "If you need to do y, use a personal cluster"

No, the data won't be migrated to a SQL warehouse, it'll be migrated to the unity catalog. SQL warehouses are just a form of compute, they don't store anything at all.

I also think making that assumption risks forcing analysts down a specific path which is actually more limited in functionality, which I don't agree is a good idea.

dfe-nt commented 3 months ago

Just saw your edit -- we agree on omitting workspace.

On DATABRICKS_SQL_WAREHOUSE_ID vs another name, I agree we should call it something else. This is my fault because initially I was trying to find standardised names for things, and I found this which had the ID (last bit of the httpPath) as DATABRICKS_SQL_WAREHOUSE_ID, which would imply having this;

con <- DBI::dbConnect(
  odbc::databricks(),
  httpPath = paste0("/sql/1.0/warehouses/", Sys.getenv("DATABRICKS_SQL_WAREHOUSE_ID"))
)

But then decided it's much cleaner to have the entire httpPath as one variable, so something like DATABRICKS_SQL_WAREHOUSE_PATH?

Yeah, this is my r environment file currently:

DATABRICKS_HOST=https://adb-5037484389568426.6.azuredatabricks.net DATABRICKS_SQL_WAREHOUSE_ID=1d1a36e815c454d0 DATABRICKS_SQL_PATH=/sql/1.0/warehouses/1d1a36e815c454d0 DATABRICKS_CLUSTER_ID=0430-122304-ctqukox3 DATABRICKS_CLUSTER_PATH=sql/protocolv1/o/5037484389568426/0819-091935-ca2crtxj DATABRICKS_TOKEN=

jen-machin commented 3 months ago

As an aside, do we understand why exactly someone would use a personal cluster over an SQL warehouse when connecting locally with R? The only use case I could find that you can interact with Unity Catalog volumes using personal clusters, but can't with odbc. Speed and functionality seems to be the same otherwise, although personal compute takes much longer to start up compared to SQL warehouse.

I think the SQL warehouse will be used by the vast majority of our analysts who already have existing analytical pipelines set up, since they'll just be using the new ODBC connection to Databricks to replace an existing connection to data stored in SQL Server currently. Am I right in assuming that eventually all data in SQL Server will be migrated to a SQL Warehouse in Databricks? I therefore wonder if the concept of connecting via a personal cluster should be kept on a separate page as it could quickly become confusing for most people. If we do plan to keep it in the page then I think we need a very clear argument AND examples of the use case for each, e.g. "If your code currently looks like x, use SQL warehouse" "If you need to do y, use a personal cluster"

No, the data won't be migrated to a SQL warehouse, it'll be migrated to the unity catalog. SQL warehouses are just a form of compute, they don't store anything at all.

Why would you call something a warehouse if no data is stored in it 😭 Ok, in that case, this makes sense. So, users can access data inside the unity catalog but NOT in volumes inside the unity catalog if they use a SQL warehouse?

dfe-nt commented 3 months ago

Why would you call something a warehouse if no data is stored in it 😭 Ok, in that case, this makes sense. So, users can access data inside the unity catalog but NOT in volumes inside the unity catalog if they use a SQL warehouse?

God only knows to be honest.

Yes, data in the unity catalog can be accessed with any form of compute. Data from volumes can be accessed using personal clusters (and possibly shared clusters but given they don't support R at the moment I'd just not make any reference to them).

wklimowicz commented 3 months ago

Yeah, this is my r environment file currently:

DATABRICKS_HOST=https://adb-5037484389568426.6.azuredatabricks.net DATABRICKS_SQL_WAREHOUSE_ID=1d1a36e815c454d0 DATABRICKS_SQL_PATH=/sql/1.0/warehouses/1d1a36e815c454d0 DATABRICKS_CLUSTER_ID=0430-122304-ctqukox3 > DATABRICKS_CLUSTER_PATH=sql/protocolv1/o/5037484389568426/0819-091935-ca2crtxj DATABRICKS_TOKEN=

Great, so we basically agree the ODBC part of the guidance would boil down to:

DATABRICKS_HOST=adb-5037484389568426.6.azuredatabricks.net
DATABRICKS_SQL_PATH=/sql/1.0/warehouses/abc123
DATABRICKS_TOKEN=...

and

con <- DBI::dbConnect(
  odbc::databricks(),
  httpPath = Sys.getenv("DATABRICKS_SQL_PATH")
)

And then a separate page on the extra steps to use a cluster (DATABRICKS_CLUSTER_ID + Python etc).

(My databricks host doesn't have https:// but it doesn't make a difference).

wklimowicz commented 3 months ago

Why would you call something a warehouse if no data is stored in it 😭 Ok, in that case, this makes sense. So, users can access data inside the unity catalog but NOT in volumes inside the unity catalog if they use a SQL warehouse?

God only knows to be honest.

Yes, data in the unity catalog can be accessed with any form of compute. Data from volumes can be accessed using personal clusters (and possibly shared clusters but given they don't support R at the moment I'd just not make any reference to them).

And to make this more confusing volumes are also part of unity catalog, so it's unity catalog tables that can be accessed with any compute, and unity catalog volumes can be accessed with personal clusters.

jen-machin commented 2 months ago

Analysts Guide: Change Databricks/RStudio connection guidance