cldellow / datasette-parquet

Add DuckDB, Parquet, CSV and JSON lines support to Datasette
Apache License 2.0
47 stars 6 forks source link
datasette datasette-plugin duckdb parquet

datasette-parquet

PyPI Changelog Tests License

Support DuckDB, Parquet, CSV and JSON Lines files in Datasette. Depends on DuckDB.

There is a demo at https://dux.fly.dev/parquet

Compare a query using Parquet on DuckDB vs the same query on SQLite. The DuckDB query is ~3-5x faster. On a machine with more than 1 core, DuckDB would outperform by an even higher margin.

Installation

Install this plugin in the same environment as Datasette.

datasette install datasette-parquet

Usage

You can use this plugin to access a DuckDB file, or a directory of CSV/Parquet/JSON files.

DuckDB file

To mount the /data/mydb.duckdb file as a database called mydb, create a metadata.json like:

{
  "plugins": {
    "datasette-parquet": {
      "mydb": {
        "file": "/data/mydb.duckdb"
      }
    }
  }
}

Directory of CSV/Parquet/JSON files

Say you have a directory of your favourite CSVs, newline-delimited JSON and parquet files that looks like this:

/data/census.csv
/data/books.tsv
/data/tweets.jsonl
/data/geonames.parquet
/data/sales/january.parquet
/data/sales/february.parquet

You can expose these in a Datasette database called trove by something like this in your metadata.json:

{
  "plugins": {
    "datasette-parquet": {
      "trove": {
        "directory": "/data",
        "watch": true
      }
    }
  }
}

Then launch Datasette via datasette --metadata metadata.json

You will have 5 views in the trove database: census, books, tweets, geonames and sales. The sales view will be the union of all the files in that directory -- this works for all of the file types, not just Parquet.

Because you passed the watch option with a value of true, Datasette will automatically discover when files are added or removed, and create or remove views as needed.

Common options

These options can be used in either mode.

httpfs - set to true to enable the HTTPFS extension

Caveats

Warning

You know that old canard, that if it walks like a duck and quacks like a duck, it's probably a duck? This plugin tries to teach DuckDB to walk like SQLite and talk like SQLite. It's difficult, and frankly, I just winged this part. If you come across broken features, let me know and I'll try to fix them up.

Technical notes

This plugin has a mix of accidental complexity and essential complexity. The essential complexity comes from things like "DuckDB supports a different dialect of SQL". The accidental complexity comes from things like "it's called the Law of Demeter, Colin, not the Strongly Held Opinion of Demeter".

This is a loose journal of things I ran into:

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd datasette-parquet
python3 -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest