evidence-dev / evidence

Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
https://evidence.dev
MIT License
3.44k stars 167 forks source link

Add DuckDB as a database connection #435

Closed jwills closed 1 year ago

jwills commented 1 year ago

Fixes #434

changeset-bot[bot] commented 1 year ago

🦋 Changeset detected

Latest commit: a4371ed2237ae44cb6fc7d0e5c6cace905d4a317

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages | Name | Type | | ----------------------------- | ----- | | @evidence-dev/duckdb | Patch | | @evidence-dev/db-orchestrator | Patch | | @evidence-dev/evidence | Patch | | evidence-test-environment | Patch |

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
evidence-docs ✅ Ready (Inspect) Visit Preview Dec 7, 2022 at 4:28AM (UTC)
mcrascal commented 1 year ago

Thank you @jwills! We'll get this reviewed and shipped asap.

Stoked that this includes the connector form as well!🥇

jwills commented 1 year ago

Thank you @mcrascal -- apologies for the crudeness of the PR, I don't write much JS. Grateful that it was easy enough for me to cobble something together though!

jwills commented 1 year ago

Hey y'all, anything I can do to help move this along?

archiewood commented 1 year ago

Hey @jwills, we're testing this as we speak!

Re if you can help: We've just made some changes to our testing suite (since you made this PR). Would be great if you could pull in the latest master into your fork - we then should be able to trigger our usual tests.

There's a helpful article I used here for a recent contribution: https://stackoverflow.com/questions/7244321/how-do-i-update-or-sync-a-forked-repository-on-github

archiewood commented 1 year ago

Hey @jwills Just playing around with this now.

This is really, really cool. As a side benefit you can also load in csv and parquet files (which we've been keen to enable!)

Couple of details I've noticed:

  1. If you try to connect to a duckdb that doesn't exist, it will create a duckdb of that name. I think it should probably break in that situation and tell the user the file doesn't exist
  2. The error messages are not unpacked - I'm pushing a try-catch to remedy this image
  3. Certain errors cause evidence to crash. For example if I try to cast a column to an illegal type e.g. select CAST(my_column as not_a_real_type) from table then it will crash the node server. Not entirely sure if this is duckdb-async or duckdb or something else!
archiewood commented 1 year ago

Hey @jwills,

Keen to ship this as it's such a massive unlock for analytics users.

Will you have time to take a look at 1 and 3 above, or is there stuff you need from us to unblock you?

jwills commented 1 year ago

hey @archiewood I know it, I'm sorry-- it's just a hellish week for me @ work and with some other stuff, will get to it when I can but it may not be until next week. :/

jwills commented 1 year ago

I guess in terms of unlock, the thing in 3 where an unknown column type causes DuckDB to crash I couldn't repro easily in the unit tests (I just got Catalog errors about the missing type); do you have an easy way to trigger it? Thanks!

archiewood commented 1 year ago

Our developer experience isn't as clear as it could be currently, so bear with me if any of the below is super obvious to you:

  1. Running Evidence from source, to test changes

    • You have cloned a local copy of the monorepo
    • You have made your changes to source files, and are in the root of the monorepo.
    • Run:
      pnpm --filter ./sites/example-project package
      pnpm --filter ./packages/evidence build 
      pnpm install
      cd example-project # This is the testing environment
      pnpm run dev
    • This boots up a server, by default at localhost:3000
  2. Throwing this specific error

    • Navigate to http://localhost:3000/database-specific/bigquery-dates
    • NB this page contains SQL that is not legal DuckDB syntax. The idea here is that this should report an error in the query component, rather than crash the node server.
    • I believe the specific SQL that is causing errors is this query
      ```dates
      select 
       CAST('2022-07-21' AS DATE) as date,
       CAST('2022-07-21 11:21:24' AS DATETIME) as datetime,
       CAST('2022-07-21 11:21:24Z' AS TIMESTAMP) as timestamp,
       CAST('11:21:24' AS TIME) as time
    • this throws:

      undefined:0
      
      [Error: Data type is not supported TIME]
jwills commented 1 year ago

Hey, sorry I closed this somehow-- it's back now. So, here's what I found.

On the first thing, Antony pushed a new version of duckdb-async that includes the OPEN_READONLY flag, so we will now fail to connect unless a DuckDB file exists, instead of just creating a new one.

On the data type crashes, I traced that to the DuckDB Node wrapper here: https://github.com/duckdb/duckdb/blob/master/tools/nodejs/src/statement.cpp#L228

It seems like we should be throwing the error as a JavaScript exception, but as you observed, it ends up crashing the entire interpreter. It was pretty easy for me to write up a standalone script that can reproduce it, so I opened an issue with the upstream DuckDB team and we'll see if we can get it fixed: https://github.com/duckdb/duckdb/issues/5125

Whether or not you want to add in DuckDB now vs. waiting for this problem to be fixed in a follow-on release is completely up to y'all; it wouldn't shock me if there were other issues in DuckDB's NodeJS client given its relative immaturity compared to the Python/R clients, so it would be great to have the Evidence project as a testbed, but I understand that may not be ideal for you all right now.

jwills commented 1 year ago

Wrote a (possible) fix for upstream here: https://github.com/duckdb/duckdb/pull/5130

jwills commented 1 year ago

Okay, upstream fix is merged, so I think we're just in a holding pattern until the next DuckDB release?

jwills commented 1 year ago

Anything else we can fix in the meantime? I noticed all of the tests failed and I'm not sure why

archiewood commented 1 year ago

Looking at the test, it's the new DuckDB test that's failing:

packages/duckdb test: Catalog Error: Cannot open database "../../undefined" in read-only mode: database does not exist

jwills commented 1 year ago

Ah okay, I had to go hunting to find where I needed to set the DUCKDB_FILENAME env var for the github tests, giving it another go now.

archiewood commented 1 year ago

Ah this is an annoying CI thing I think. The test run off the base branch, not the fork, so the updates you make to the tests.yml file will not impact the test run on this PR.

I need to update the tests separately

archiewood commented 1 year ago

Nope that's not it either... Same error.

Clearly something's going wrong here:

const filename = database ? database.filename : process.env["DUCKDB_FILENAME"] || process.env["filename"] || process.env["FILENAME"]
const filepath = filename !== ":memory:" ? "../../" + filename : filename

What we want

  1. Finding there is no database
  2. Assigning the filename to the environment variable, DUCKDB_FILENAME (which ==":memory:")
  3. Therefore filepath is assigned as ":memory:"

But instead:

Which either means it is:

archiewood commented 1 year ago

Perhaps you need to pull in the latest master into your fork again? (specifically to pull in the env var change from the main branch?)

jwills commented 1 year ago

yay, success

archiewood commented 1 year ago

Just to keep everyone up to date, we've encountered a couple of issues with the Node.js duckDB connector we want to resolve before shipping this:

archiewood commented 1 year ago

Continuing to keep this up to date:

When they are, we're going to ship this.

archiewood commented 1 year ago

@jwills Binaries appear to be fixed. I think it might be merge time!

jwills commented 1 year ago

😍

Mause commented 1 year ago

We got there eventually 😅🦆

archiewood commented 1 year ago

oops, looks like this was missing a changeset for components