Open rosenbrockc opened 8 years ago
I like the idea of running dimensional info gov.us.fastfacts.crime
and getting a print out similar to what I was trying to show in the quick gist
I agree. I also thought that it would be useful to know where the columns came from and how they were transformed. For example, if I have been using data scraped from an HTML page and it gets updated, I would like to be able to search on that URL to see if anyone has figured out the new formatting; even if your combined set produces just one cleaned column from the new format, then I could grab that and use it with the rest of my dataset.
I also thought we should give a snappy name to the new combined datasets. Since the project is called dimensional
, how about dim(s)
. Then we could have a folder called dims
which has a file for each contributed dataset (dim) (though we still need to figure out the file format, I would vote for JSON).
One thing I'd like to have in the delivered package is the actual ETL script. Even if this is mostly ignored, we'll always have a way to inspect where the data came from, what was done to it, what we get back. Just in case there are ever any issues.
As far as dim, I think we'd like to use that for the CLI: dim info go.us.fastfacts.crime. I also like the idea of having a short/easy/consistent name for what is delivered. I've been calling them datasets, but a dim, or a dim package, or a dimension could work really well.
These are probably helpful to clear up right from the start. When the script is supposed to search the available combinations of datasets, where does it look? This will define how we organize the recipes for combining column-level ETLs in the repo.
I like the way that homebrew and melpa have the brews/recipes just as separate files. @davidrichards mentioned this in Slack as well. So
search
would just scan the list of files in a folder called??
and return those based on afnmatch
of the search string the user gave. This can be implemented in 5 lines easily. Speaking of which, do we want to separate out the file managament/searching etc. to a separate module in the package, or just keep it in thecli.py
?For
info
, we would grab details from the specific recipe file. But we need to decide what metadata to include in that file. Using a version ofDataset
of @jpotts18, we could extract most of the relevant information. Should we just serialize those fields that contribute to obvious metadata into a JSON string to store in the recipe file? Then we could just grab it off the first line after a comment char and return a pretty-printed version of it to the terminal.