NYCPlanning / db-data-library

📚 Data Library
https://nycplanning.github.io/db-data-library/library/index.html
MIT License
0 stars 1 forks source link

Data flow diagrams #382

Closed fvankrieken closed 1 year ago

fvankrieken commented 1 year ago

We want to look into generative data flow diagrams representing potentially both literal flow of data as well as representation of processing steps to better map out our existing processes.

https://www.lucidchart.com/pages/data-flow-diagram

Originally had thought about doing this simply in something like LucidChart, but something programmatic would be nice to not have to essentially duplicate a lot of the info already in the config templates in data-library. This package has some nice functionality

https://diagrams.mingrammer.com/

Ideally, I'd like to be able to generate diagrams at multiple levels

fvankrieken commented 1 year ago

It might be worthwhile to do some very simple changes to the templates right now as well, for the sake of clarity.

Right now we have 3 possible sources for data in the templates

However, I personally would break things down slightly differently.

First we have the actual source. These are, as far as I can tell

Data sources with "script" as the source are still pulling from some sort of filepath, be it http, s3, or local file but it's a bit hidden within the script as this process currently runs. So personally, just to break things out a bit more by category, I would like "source" to reflect the four categories above (either with each as a distinct option rather than "url" for all 3 or having all of them in the same category - "url" or "path" or equivalent - and then have subfields denoting type of path), and then maybe break out "script" into another part of the template. Then, we'd likely move from two steps - just ingest and archive, with transformations happening for some datasets in the ingest step - to 3: ingest, transform, archive (which maybe is just a different way of saying ETL...)

@damonmcc @AmandaDoyle @mbh329

fvankrieken commented 1 year ago

Two edge cases for scripts, expanding on above