Closed fvankrieken closed 1 year ago
It might be worthwhile to do some very simple changes to the templates right now as well, for the sake of clarity.
Right now we have 3 possible sources for data in the templates
However, I personally would break things down slightly differently.
First we have the actual source. These are, as far as I can tell
Data sources with "script" as the source are still pulling from some sort of filepath, be it http, s3, or local file but it's a bit hidden within the script as this process currently runs. So personally, just to break things out a bit more by category, I would like "source" to reflect the four categories above (either with each as a distinct option rather than "url" for all 3 or having all of them in the same category - "url" or "path" or equivalent - and then have subfields denoting type of path), and then maybe break out "script" into another part of the template. Then, we'd likely move from two steps - just ingest and archive, with transformations happening for some datasets in the ingest step - to 3: ingest, transform, archive (which maybe is just a different way of saying ETL...)
@damonmcc @AmandaDoyle @mbh329
Two edge cases for scripts, expanding on above
We want to look into generative data flow diagrams representing potentially both literal flow of data as well as representation of processing steps to better map out our existing processes.
https://www.lucidchart.com/pages/data-flow-diagram
Originally had thought about doing this simply in something like LucidChart, but something programmatic would be nice to not have to essentially duplicate a lot of the info already in the config templates in data-library. This package has some nice functionality
https://diagrams.mingrammer.com/
Ideally, I'd like to be able to generate diagrams at multiple levels