NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Data Library - Move "scripts" logic to processing steps #633

Closed fvankrieken closed 3 weeks ago

fvankrieken commented 3 months ago

Once #632 is complete. This has two main parts, so far

  1. abstract out any code in these scriptors that are generic and can be applied to multiple datasets
  2. add "custom" script processing steps for relevant datasets. Maybe 1 to start, if 1 "script" source dataset has already been chosen and prototyped in #499
sf-dcp commented 3 months ago

My notes after reviewing logic of scripts in the script directory:

After one of these steps, read in into pandas df and save to csv locally (using Scriptor).

It does seem like we need to expand our current code for ingestion as most of these scripts have a custom ingestion logic.

cc: @fvankrieken

fvankrieken commented 1 month ago

So of these, we can put things in a couple buckets, these first two being upstream of the "processing steps" part of our process

Leaving ones that either

  1. might need an actual "script source"
    • scrape web data, use beautiful soup
    • query api dynamically
  2. have logic that should live in preprocessing steps.
    • filter columns or rows
    • append data to previous version

fisa_dailybudget has changed since you wrote this up, it now is following largely the same pattern as dob_cofos

Going to add to this any datasets that specify sql in their gdal options in templates. But seems like to start, we're working with a pretty short list here

fvankrieken commented 1 month ago

There also are some cases where we have some more complex preprocessing that should still happen post-extraction, but might be specific to data (and need custom sql). dpr_capitalprojects is a good example. For now, that can just be another available function, we'll see how simple it is to break it up into abstracted bits

fvankrieken commented 1 month ago

A checklist to keep track of which scripts have their functionality taken care of (at least in theory). First, the less relevant ones.

Then the ones that are more relevant to the code I'm touching

fvankrieken commented 1 month ago

Things not yet captured