Data flow diagrams - Githubissues

fvankrieken commented 1 year ago

We want to look into generative data flow diagrams representing potentially both literal flow of data as well as representation of processing steps to better map out our existing processes.

https://www.lucidchart.com/pages/data-flow-diagram

Originally had thought about doing this simply in something like LucidChart, but something programmatic would be nice to not have to essentially duplicate a lot of the info already in the config templates in data-library. This package has some nice functionality

https://diagrams.mingrammer.com/

Ideally, I'd like to be able to generate diagrams at multiple levels

full data flow diagram mapping inputs (source data) to data products
option to show detailed metadata of source data
representation of processing steps - scripts in data library, preprocessing for data products, etc

fvankrieken commented 1 year ago

It might be worthwhile to do some very simple changes to the templates right now as well, for the sake of clarity.

Right now we have 3 possible sources for data in the templates

url
script
socrata

However, I personally would break things down slightly differently.

First we have the actual source. These are, as far as I can tell

web endpoint
s3 endpoint
local file
socrata

Data sources with "script" as the source are still pulling from some sort of filepath, be it http, s3, or local file but it's a bit hidden within the script as this process currently runs. So personally, just to break things out a bit more by category, I would like "source" to reflect the four categories above (either with each as a distinct option rather than "url" for all 3 or having all of them in the same category - "url" or "path" or equivalent - and then have subfields denoting type of path), and then maybe break out "script" into another part of the template. Then, we'd likely move from two steps - just ingest and archive, with transformations happening for some datasets in the ingest step - to 3: ingest, transform, archive (which maybe is just a different way of saying ETL...)

@damonmcc @AmandaDoyle @mbh329

fvankrieken commented 1 year ago

Two edge cases for scripts, expanding on above

files that require custom http requests
grabbing from private s3

NYCPlanning / db-data-library

Data flow diagrams #382