SprocketBot / datasets

2 stars 2 forks source link

Feature Request: Add dimensionality to queries #53

Open ItsMeBrianD opened 3 months ago

ItsMeBrianD commented 3 months ago

DuckDB has a concept of Hive Partitioning that we want to mimic for large datasets (e.g. player stats) to prevent massive queries from being run.

This will come in two parts:

Defining Dimensions

I propose that we use frontmatter in SQL to define dimensionality, so that each dimension defines a query that has the results

Single Dimension Query

set.sql

---
    dimensions:
        season: SELECT someColumn as dimension FROM someTable
---
SELECT * 
FROM someTable
WHERE someColumn = ${season}

This would result in a folder structure 1 level deep:

set/
    season=1/
        data.parquet
        data.csv
        data.json
    season=2/
        data.parquet
        data.csv
        data.json
Multi Dimension Query

set2.sql

---
    dimensions:
        season: SELECT someColumn as dimension FROM someTable
        match: SELECT someColumn as dimension FROM someMatchTable
---
SELECT * 
FROM someTable
WHERE someColumn = ${season} 
  AND someOtherColumn = ${match}

This would result in a folder structure 2 levels deep:

set/
    season=1/
        match=1/
            data.parquet
            data.csv
            data.json
        match=2/
            data.parquet
            data.csv
            data.json
    season=2/
        match=1/
            data.parquet
            data.csv
            data.json
        match=2/
            data.parquet
            data.csv
            data.json

Proper output

SQL Snippets

Currently on our dataset pages we have a SQL snippet that creates a view, and download buttons for the files; the SQL snippet will need to be updated to properly load all the files:

CREATE VIEW set AS (
    SELECT *
    FROM read_parquet([
        'https://example.com/data/set/season=1/data.parquet', 
        'https://example.com/data/set/season=2/data.parquet', 
        'https://example.com/data/set/season=3/data.parquet' -- etc...
    ], hive_partitioning = true);
)

Batch Download

We should also consider creating a .tar.gz file for each format for the entire set that reflects the file structure, for example the parquet download button would reference a set.parquet.tar.gz file with the contents:

season=1/
    data.parquet
season=2/
    data.parquet

This allows users to still download the entire dataset for local analysis

Manifest

It may also be helpful to produce a manifest of URLs for a partitioned set, so that non-duckdb programs can easily reference all of the files; the structure of this is TBD, and this is not required as part of the first version

ItsMeBrianD commented 3 months ago

As another thought, it might make more sense to define dimensions in the sibling markdown file instead of trying to handle frontmatter in the SQL itself

ItsMeBrianD commented 3 months ago

https://gist.github.com/ItsMeBrianD/6a0d9fbf42856a67958ae6e21b60224d

This file works as described above; and just needs to be split into prefect-y tasks.

One challenge here is trying to build the website at the end - if we want to address #45, then we need to have a way of defining which queries are "supposed" to exist, along with which queries "actually" exist. (e.g. figuring out which pages to build while accounting for deleted / failed queries)

Not sure what the best approach to the above is, unless we want to have some sort of "broken" page that is thrown up for any queries that failed (?). This would just mean that we can build the navbar before firing off the queries, and then each query flow is responsible for building it's own page (succeed or fail)