Closed fvankrieken closed 3 weeks ago
My notes after reviewing logic of scripts in the script
directory:
dcp_pad
, dcp_facilities_with_unmapped
dob_cofos
bpl_libraries
dob_now_applications
, dob_now_permits
doe_pepmeetingurls
, moeo_socialservicesitelocations
fisa_capital_commitments
fisa_dailybudget
hra_centers
After one of these steps, read in into pandas df and save to csv locally (using Scriptor).
It does seem like we need to expand our current code for ingestion as most of these scripts have a custom ingestion logic.
cc: @fvankrieken
So of these, we can put things in a couple buckets, these first two being upstream of the "processing steps" part of our process
to_parquet
partLeaving ones that either
fisa_dailybudget
has changed since you wrote this up, it now is following largely the same pattern as dob_cofos
Going to add to this any datasets that specify sql
in their gdal options in templates. But seems like to start, we're working with a pretty short list here
There also are some cases where we have some more complex preprocessing that should still happen post-extraction, but might be specific to data (and need custom sql). dpr_capitalprojects
is a good example. For now, that can just be another available function, we'll see how simple it is to break it up into abstracted bits
A checklist to keep track of which scripts have their functionality taken care of (at least in theory). First, the less relevant ones.
Then the ones that are more relevant to the code I'm touching
Things not yet captured
Once #632 is complete. This has two main parts, so far