databrickslabs / brickster

R Toolkit for Databricks
https://databrickslabs.github.io/brickster/
Apache License 2.0
43 stars 7 forks source link

Feature: Adding {DBI} & {dbplyr} backend for SQL #47

Open zacdav-db opened 4 months ago

zacdav-db commented 4 months ago

Currently {brickster} has two ways to query a SQL warehouse:

  1. databricks-sql-connector via {reticulate}
  2. SQL execution API

A {DBI} and {dbplyr} backend could be created for each, this would enable querying SQL warehouses without ODBC driver installation.

I've done a small proof of concept and this works extremely well, few bits to iron out. Unsure if worth splitting to another package 🤔.

zacdav-db commented 3 months ago
ndiquattro commented 1 month ago

A DBI/dbplyr interface to the sql execution api is very interesting to me! I'm currently using ODBC locally and then connecting via sparklyr when working in a notebook (via an internal R package). Most of the time the same code works, but there are some quirks where I need to disbatch to methods based on connection type. It sounds like the SQL api would be a unifying method and easier for the co-workers to setup (Just need the PAT, I'd think?)

I was looking for your work on it, but couldn't find it the branches. Is it some you're still considering? Thanks!

zacdav-db commented 1 month ago

Most of the time the same code works, but there are some quirks where I need to disbatch to methods based on connection type

@ndiquattro I'd love to know more about the methods that are causing some friction so I can create test cases.

Just need the PAT, I'd think? Ideally, not even PAT, OAuth would be first class citizen ideally.

There isn't a branch yet, the test I'd done was in a seperate project. I hope to get around to this shortly after a few more changes to package.

ndiquattro commented 1 month ago

The largest one has been around DBI::dbWriteTable() which had three issues:

  1. When connecting with ODBC, I needed to connect with useNativeQuery = FALSE, in order for writing to work. I end up making a temporary connection within my helper function to do this while keeping it TRUE elsewhere.
  2. ODBC and sparklyr had different conventions for the target table's name. ODBC wanting it wrapped in Id and full name, whereas sparklyr just wanting a character value consiting of just schema.name
  3. With append = TRUE, the ODBC connection will create the table if it doesn't exist, but sparklyr will error.

Trying to fix issue 3 led me to differences with DBI::dbListTables()

  1. ODBC lets you do a search pattern for the table name in dbListTables(), where sparklyr is templating a SHOW TABLES query, that doesn't offer that ability.

Thanks for your work on this package!