Open bendnorman opened 6 months ago
When I first set up the Kaggle dataset, I ran some tests trying to use a datapackage.json to annotate the dataset and found the infrastructure to be non-functional. I posted several messages in their support forums:
Note that only frictionless>=5
can annotate an SQLite DB.
Kaggle will automatically create a new version of the dataset on whatever schedule we want (daily, weekly, etc. -- I had it set to weekly udpates) and it will pull new data from the URLs that are specified as the data sources. We only need to intervene when those URLs change, which will hopefully be pretty uncommon. We can decide to point the dataset at /nightly
or maybe /stable
. Obviously it would be better if we could have it automatically pick up changes in the URLs too! But this is already pretty good.
Right now we are manually updating our Kaggle dataset. Ideally, we would use the kaggle API to automatically update the kaggle version when there is a new version. I took a stab at using the kaggle API but ran into an issue.
Kaggle uses the datapackage schema to track metadata about datasets. I pulled the existing metadata for the PUDL dataset with this command:
where the current directory contained all of the
.parquet
,.sqlite.gz
and.json
files of the nightly outputs.Then I tried to create a new version with this command:
Which uploaded all of the data but then failed with this error:
Dataset version creation error: Incompatible Dataset Type
There might be a bug that prevents folks from updating manually created datasets using the Kaggle API. I was able to initialize and update a private Kaggle dataset with the same pudl output files using the CLI.
I propose we point our notebooks at a new kaggle dataset that can be updated using the CLI.