Closed nickrobinson251 closed 3 years ago
Thanks for the feedback. When I have added this flag I have really not thought too much about the usability so this conversation is really useful to gain insights.
The main purpose of this was to make it easier to do local development/ETL execution by fetching smaller subset of data to work with (e.g. only specific year) w/o having to run the ETL first (that would do this download on its first pass).
As it is implemented, you can only pass a single value to each valid partition-key, so you can't really do more than just one state or year and if you specify an unknown partition, the filtering logic will simply exclude every resource file (nothing matches).
While we could perhaps infer some of the valid partition keys from constants
the right way to do this would be to fetch datapackage.json
files for all known datasets from zenodo (either production or sandbox) and grab all keys from resources[].parts
. Because this would depend on contacting zenodo, it's probably not a good idea to add it to the default help screen.
What about listing this by running pudl_datastore --list-partitions
?
The main purpose of this was to make it easier to do local development/ETL execution
This is actually what i'm using it for :)
For context: I'd like to set up running the ETL on AWS, but making sure i've things running as expected locally first. I'd been working with v3.2, but would probably like to use the newest release when it comes out, so have switched to using main
to make that later switch easier. I just happened to notice this small dip in useability when switching from v3.2 to main
.
What about listing this by running
pudl_datastore --list-partitions
?
Sure, that kinda interface would work nicely i think; it'd give us a way to see this info on the command line, without making pudl_datastore --help
too verbose.
Interesting. I am curious to hear more about your use-case as I have been spending some time here trying to deploy the ETL on google cloud. This currently consists of couple of changes that:
It seems that there's quite an overlap between our intents. I have several bugs to track various aspect but the best one for the prototype is probably https://github.com/catalyst-cooperative/pudl/issues/895
Adding --list-partitions
should be pretty straightforward. I will try to get it out of the door soon.
Resolved by https://github.com/catalyst-cooperative/pudl/pull/925 Thanks!
Is your feature request related to a problem? Please describe.
$ pudl_datastore --help
lists a--partitions
option but not how to use it, as it does not list the validKEY=VALUE
arguments (e.g.--partitions year=2018
and so on). This is using the currentmain
branch (at https://github.com/catalyst-cooperative/pudl/tree/a2c1b996ea81015e586e392bb95609da76161cec).Describe the solution you'd like
I think it'd be helpful to be able to list the valid
partitions
for a given dataset at the command line when usingpudl_datastore
.pudl_datastore --help
(although as validpartitions
vary by dataset this could be quite verbose).although i think it would be best if the printed info (
year=...
) was valid syntax for the command, so it can be copy-pasted intopudl_datastore --partitions <paste>
pudl_datastore
direct users to thatpudl_datastore
could print something along the lines ofAgain, copy-pastable output (from a
pudl_partitions
kind of command) would be nice to have :)Describe alternatives you've considered
A current "work around" is to open python,
import pudl.constants as pc
and inspectpc.working_partitions
.Additional context
The output of
pudl_datastore --help
(usingmain
) says only that--partitions
should be key-value pairs. e.g.Comparing this to the output of
pudl_data --help
using pudl v3.2We see
pudl_data
would list the valid-y
(years) and-t
(states) values.p.s.
pudl
is great -- thanks for all your good work!