catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Easy way to discover valid `--partitions` when using `pudl_datastore` #915

Closed nickrobinson251 closed 3 years ago

nickrobinson251 commented 3 years ago

Is your feature request related to a problem? Please describe.

$ pudl_datastore --help lists a --partitions option but not how to use it, as it does not list the valid KEY=VALUE arguments (e.g. --partitions year=2018 and so on). This is using the current main branch (at https://github.com/catalyst-cooperative/pudl/tree/a2c1b996ea81015e586e392bb95609da76161cec).

Describe the solution you'd like

I think it'd be helpful to be able to list the valid partitions for a given dataset at the command line when using pudl_datastore.

Describe alternatives you've considered

A current "work around" is to open python, import pudl.constants as pc and inspect pc.working_partitions.

Additional context

The output of pudl_datastore --help (using main) says only that --partitions should be key-value pairs. e.g.

$ pudl_datastore --help                                                                                                                
usage: pudl_datastore [-h] [--dataset DATASET] [--pudl_in PUDL_IN] [--validate] [--sandbox] [--loglevel LOGLEVEL] [--quiet] [--populate-gcs-cache POPULATE_GCS_CACHE]
                      [--partition KEY=VALUE,...]

Download and cache ETL source data from Zenodo.

optional arguments:
  -h, --help            show this help message and exit
...
...
  --partition KEY=VALUE,...
                        Only retrieve resources matching these conditions.

Available Production Datasets:
    - censusdp1tract
...

Comparing this to the output of pudl_data --help using pudl v3.2

$ pudl_data --help
usage: pudl_data [-h] [-q] [-z] [-c] [-d DATASTORE_DIR]
                 [-s {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...]]
                 [-y YEARS [YEARS ...]] [--no_download]
                 [-t {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...]]

A CLI for fetching public utility data from reporting agency servers. 
...
...

optional arguments:
  -h, --help            show this help message and exit
...
...
  -s {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...], --sources {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...]
                        List of data sources which should be downloaded.
                        (default: ('eia860', 'eia861', 'eia923', 'epacems',
                        'epaipm', 'ferc1')).
  -y YEARS [YEARS ...], --years YEARS [YEARS ...]
                        List of years for which data should be downloaded.
                        Different data sources have differet valid years. If
                        data is not available for a specified year and data
                        source, it will be ignored. If no years are specified,
                        all available data will be downloaded for all
                        requested data sources.
...
...
  -t {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...], --states {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...]
                        List of two letter US state abbreviations indicating
                        which states data should be downloaded. Currently only
                        applicable to the EPA's CEMS dataset.

We see pudl_data would list the valid -y (years) and -t (states) values.


p.s. pudl is great -- thanks for all your good work!

rousik commented 3 years ago

Thanks for the feedback. When I have added this flag I have really not thought too much about the usability so this conversation is really useful to gain insights.

The main purpose of this was to make it easier to do local development/ETL execution by fetching smaller subset of data to work with (e.g. only specific year) w/o having to run the ETL first (that would do this download on its first pass).

As it is implemented, you can only pass a single value to each valid partition-key, so you can't really do more than just one state or year and if you specify an unknown partition, the filtering logic will simply exclude every resource file (nothing matches).

While we could perhaps infer some of the valid partition keys from constants the right way to do this would be to fetch datapackage.json files for all known datasets from zenodo (either production or sandbox) and grab all keys from resources[].parts. Because this would depend on contacting zenodo, it's probably not a good idea to add it to the default help screen.

What about listing this by running pudl_datastore --list-partitions?

nickrobinson251 commented 3 years ago

The main purpose of this was to make it easier to do local development/ETL execution

This is actually what i'm using it for :)

For context: I'd like to set up running the ETL on AWS, but making sure i've things running as expected locally first. I'd been working with v3.2, but would probably like to use the newest release when it comes out, so have switched to using main to make that later switch easier. I just happened to notice this small dip in useability when switching from v3.2 to main.

What about listing this by running pudl_datastore --list-partitions?

Sure, that kinda interface would work nicely i think; it'd give us a way to see this info on the command line, without making pudl_datastore --help too verbose.

rousik commented 3 years ago

Interesting. I am curious to hear more about your use-case as I have been spending some time here trying to deploy the ETL on google cloud. This currently consists of couple of changes that:

  1. use prefect orchestration to manage task dependencies and allow use of parallel execution of the pipeline (e.g. deploying this on dask cluster)
  2. building docker containers with the code + all dependencies
  3. Adding support for remote storage (reading from GCS instead of Zenodo and writing ETL results to GCS instead of local disk)

It seems that there's quite an overlap between our intents. I have several bugs to track various aspect but the best one for the prototype is probably https://github.com/catalyst-cooperative/pudl/issues/895

rousik commented 3 years ago

Adding --list-partitions should be pretty straightforward. I will try to get it out of the door soon.

nickrobinson251 commented 3 years ago

Resolved by https://github.com/catalyst-cooperative/pudl/pull/925 Thanks!