Open ItsMeBrianD opened 3 months ago
As another thought, it might make more sense to define dimensions in the sibling markdown file instead of trying to handle frontmatter in the SQL itself
https://gist.github.com/ItsMeBrianD/6a0d9fbf42856a67958ae6e21b60224d
This file works as described above; and just needs to be split into prefect-y tasks.
One challenge here is trying to build the website at the end - if we want to address #45, then we need to have a way of defining which queries are "supposed" to exist, along with which queries "actually" exist. (e.g. figuring out which pages to build while accounting for deleted / failed queries)
Not sure what the best approach to the above is, unless we want to have some sort of "broken" page that is thrown up for any queries that failed (?). This would just mean that we can build the navbar before firing off the queries, and then each query flow is responsible for building it's own page (succeed or fail)
DuckDB has a concept of Hive Partitioning that we want to mimic for large datasets (e.g. player stats) to prevent massive queries from being run.
This will come in two parts:
Defining Dimensions
I propose that we use frontmatter in SQL to define dimensionality, so that each dimension defines a query that has the results
Single Dimension Query
set.sql
This would result in a folder structure 1 level deep:
Multi Dimension Query
set2.sql
This would result in a folder structure 2 levels deep:
Proper output
SQL Snippets
Currently on our dataset pages we have a SQL snippet that creates a view, and download buttons for the files; the SQL snippet will need to be updated to properly load all the files:
Batch Download
We should also consider creating a
.tar.gz
file for each format for the entire set that reflects the file structure, for example the parquet download button would reference aset.parquet.tar.gz
file with the contents:This allows users to still download the entire dataset for local analysis
Manifest
It may also be helpful to produce a manifest of URLs for a partitioned set, so that non-duckdb programs can easily reference all of the files; the structure of this is TBD, and this is not required as part of the first version