catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Add kaggle update to release script #3182

Open bendnorman opened 6 months ago

bendnorman commented 6 months ago

Right now we are manually updating our Kaggle dataset. Ideally, we would use the kaggle API to automatically update the kaggle version when there is a new version. I took a stab at using the kaggle API but ran into an issue.

Kaggle uses the datapackage schema to track metadata about datasets. I pulled the existing metadata for the PUDL dataset with this command:

kaggle datasets metadata -p . catalystcooperative/pudl-project

where the current directory contained all of the .parquet, .sqlite.gz and .json files of the nightly outputs.

Then I tried to create a new version with this command:

kaggle datasets version -p . -m "Update PUDL dataset to use nightly build outputs from 2023.12.20"

Which uploaded all of the data but then failed with this error: Dataset version creation error: Incompatible Dataset Type

There might be a bug that prevents folks from updating manually created datasets using the Kaggle API. I was able to initialize and update a private Kaggle dataset with the same pudl output files using the CLI.

I propose we point our notebooks at a new kaggle dataset that can be updated using the CLI.

- [ ] Create a `datapackage.json` for the kaggle dataset
- [ ] Add some logic to `gcp_pudl_etl.sh` that updates the kaggle dataset on a tag
- [ ] Point our kaggle notebooks at the new dataset and delete the old dataset
zaneselvans commented 6 months ago

Datapackage annotation

When I first set up the Kaggle dataset, I ran some tests trying to use a datapackage.json to annotate the dataset and found the infrastructure to be non-functional. I posted several messages in their support forums:

Note that only frictionless>=5 can annotate an SQLite DB.

Updating the dataset

Kaggle will automatically create a new version of the dataset on whatever schedule we want (daily, weekly, etc. -- I had it set to weekly udpates) and it will pull new data from the URLs that are specified as the data sources. We only need to intervene when those URLs change, which will hopefully be pretty uncommon. We can decide to point the dataset at /nightly or maybe /stable. Obviously it would be better if we could have it automatically pick up changes in the URLs too! But this is already pretty good.