Data Library - Move "scripts" logic to processing steps

fvankrieken commented 3 months ago

Once #632 is complete. This has two main parts, so far

abstract out any code in these scriptors that are generic and can be applied to multiple datasets
add "custom" script processing steps for relevant datasets. Maybe 1 to start, if 1 "script" source dataset has already been chosen and prototyped in #499

sf-dcp commented 3 months ago

My notes after reviewing logic of scripts in the script directory:

read in csv with custom encoding or delimiter
read in excel with specific sheet name
specify needed columns or filter rows
extract zip file from API and get a file with a specific name
- Ex: dcp_pad, dcp_facilities_with_unmapped
append new data to old version of the data
- Ex: dob_cofos
extract json data from API, process lat and lon from existing fields
- Ex: bpl_libraries
extract data from our own bucket like edm-private
- Ex: dob_now_applications, dob_now_permits
scrape data from a website using requests & Beautiful Soup
- Ex: doe_pepmeetingurls, moeo_socialservicesitelocations
read in csv with no headers and create your own
- Ex: fisa_capital_commitments
read files from a directory with a specific pattern in a file name and concatenate them
- Ex: fisa_dailybudget
query API dynamically (with different link pattern) to get multiple datasets into one
- Ex: hra_centers

After one of these steps, read in into pandas df and save to csv locally (using Scriptor).

It does seem like we need to expand our current code for ingestion as most of these scripts have a custom ingestion logic.

cc: @fvankrieken

fvankrieken commented 1 month ago

So of these, we can put things in a couple buckets, these first two being upstream of the "processing steps" part of our process

extract logic that seems largely accounted for
- pull from edm-private
custom reading in of data (now handled in transformation to parquet)
- csv with custom encoding/delimiter
- sheet name of excel file
- zipped file
- extract json data from api, process lat/long. This one is maybe a bit of a question, but still think that by adding 1. a jq querystring subfield and 2. info about geom column, this can be handled in the to_parquet part
- specify headers manually

Leaving ones that either

might need an actual "script source"
- scrape web data, use beautiful soup
- query api dynamically
have logic that should live in preprocessing steps.
- filter columns or rows
- append data to previous version

fisa_dailybudget has changed since you wrote this up, it now is following largely the same pattern as dob_cofos

Going to add to this any datasets that specify sql in their gdal options in templates. But seems like to start, we're working with a pretty short list here

fvankrieken commented 1 month ago

There also are some cases where we have some more complex preprocessing that should still happen post-extraction, but might be specific to data (and need custom sql). dpr_capitalprojects is a good example. For now, that can just be another available function, we'll see how simple it is to break it up into abstracted bits

fvankrieken commented 1 month ago

A checklist to keep track of which scripts have their functionality taken care of (at least in theory). First, the less relevant ones.

[x] bpl_libraries (should be taken care of by json readin)
[x] dcas_ipis
[ ] dcp_censusdata_blocks (slightly more custom excel)
[ ] dcp_censusdata (slightly more custom excel)
[x] dcp_facilities_with_unmapped
[x] dcp_pad
[x] doe_lcgms
[x] doe_pepmeetingurls (custom script source for now)
[ ] excel
[x] hpd_historical_units_by_building
[ ] hra_centers (scrape/soup -> probably custom script source for now)
[ ] moeo_socialservicesitelocations (joins 4 datasets. should be split into 5 recipe datasets)
[ ] nycdoc_corrections (soup. but just one webpage, nothing dynamic)
[ ] nycoc_checkbook (special script. Connector!)
[x] nypl_libraries (should be handled by json parser hopefully)
[x] nysed_nonpublicenrollment
[x] usdot_airports
[ ] usfws_nyc_wetlands (eventually could break into multiple that get merged, for now just keep as custom script source)

Then the ones that are more relevant to the code I'm touching

[x] dcp_sfpsd (filter rows)
[x] dob_cofos (actual preprocessing, append to prev, deduplicate)
[x] dob_now_applications (edm-private)
[x] dob_now_permits (edm-private)
[x] dpr_capitalprojects (simple DF source but multiple processing steps)
[x] fisa_capitalcommitments (set columns, drop column, strip whitespace)
[x] fisa_dailybudget (set columns, drop column, append to prev, custom deduplciate)
[ ] uscourts_courts - funky one. Hits two endpoints, appends, deduplicates. Not could break into two, but not sure

fvankrieken commented 1 month ago

Things not yet captured

dtypes (thinking specifically of fisa_dailybudget - many numeric and string fields, including a string field that is numeric with leading zeroes)
more complex logic for dob_cofos (split string within column)

NYCPlanning / data-engineering

Data Library - Move "scripts" logic to processing steps #633