Closed jikaczmarski closed 2 months ago
Perhaps we could keep data in SQLite .db format, then run queries to build CSV downloads.
@dayne do we point data downloads towards https://github.com/acep-uaf/ak-energy-statistics-2011_2021 ?
Converting this to an Epic because we need to think about how the data is organized as part of this. Scheduling a kick-off meeting for that discussion in Early April.
From our conversation today @eldobbins , I started to explore data organizations. We talked about three directories, which I've created in /data/
.
The data/
directory has been built out with three subdirectories, raw
, working
, and final
. Within each of the three is a markdown file with a brief description of what should be there.
raw/
is for CSVs used to build the database. This directory could easily get extremely messy, it's important to guard against chaos here. In the future, this can be a landing spot for a pipeline script or import from the workbooks located in the other repo, ak-energy-statistics-2011_2021
working/
contains the SQLite database, as well as the code to build it. If all goes well, this folder should be easy to keep clean. Either it's in the database or it should live somewhere else.
final/
contains scripts and files for public-facing products, such as CSV downloads for researchers. The scripts here will extract from the SQLite database and output CSVs. In the future, this will be triggered by updates to the database and run via an action.
At the moment, I have price tables and a few capacity tables in the database. The page prices.qmd
is running on the database. capacity.qmd
could follow suit, but will need a little tweaking for derived tables and the like. @jikaczmarski , we should chat about this soon.
I'm pivoting to think about code to generate CSV files from the database and make download links. You can see a window into the database on the new data page (live, but not linked in the sidebar, so not quite public).
None of this is permanent, and I'm really looking forward to more talk about organization and workflow.
I like this general structure. Could you have subdirectories in raw/
for generation, price, capacity?
It turns out .db and .zip files are both binary format, so not ideal to host on a repo. There was talk of hosting the db on Google Drive, but we may run into permissions issues? The script that builds the database from raw files needs to have write permissions, while the scripts that render the webpage should not have write permissions, correct?
It seems like a good idea to have an action watch the raw data directory and rebuild the database when changes are made. And if we're going to have a zip of all tables, we need that to rebuild upon changes to the database.
It feels like we're slow walking towards a rudimentary pipeline with pub/sub actions and maybe an ephemeral VM for building out the database and zip. GCP rocks for this sort of stuff, but I need to upskill in order to set it up. I'd like to expand my skills in this direction anyways, so it might be the perfect time to learn? @jikaczmarski sounds interested too!
Potential new directory layout
working
by the GitHub Action in #31 There was a lot of discussion about this topic yesterday. Highlights include:
The data page now has table previews and CSV downloads for the 4 tables that we're currently using to generate the visuals.
@jikaczmarski @eldobbins We're at a stopping point on the data page. We could either close this issue or regroup and decide on changes/features (minus #39, adding a metadata parser and corresponding links).
Two more items to do:
Added consumption data to the data portal.
Modified the download buttons to display pretty names instead of file names.
Added second column to display "Download" as markdown alongside download button.
Data page is in fine shape for now. Closing this issue.
We would like to see a page where one could access the data right away.