Operationalizing GFT - Githubissues

As GFT is put into production, we need to figure out what making sure it functions operationally is going to look like. There main big parts are updating source data and QA, with some other trailing tasks as well

Source Data

Categorizing by source type and how we update them currently

Pluto

dcp_mappluto_wi - self-explanatory

Bytes quarterly updates

These are pretty straightforward. Lion has one issue, its parquet doesn't get created without error at the moment, so I've run manually when it's needed

dcp_boroboundaries_wi
[ ] dcp_lion - needs some investigation as to why gdal errors when creating parquet

"ceqr app" data

Each of these is something else a bit under the hood. Needs some investigation. For now, we can "build" ceqr data

[ ] figure out short and long term plan for ceqr app datasets

This includes

dep_cats_permits
nysdec_state_facility_permits
nysdec_title_v_facility_permits

ArcGIS Feature Service

These can have version programmatically determined, so maybe should be pulled on weekly basis

[ ] set up recurring job

Datasets here are

dcp_cscl_commonplace
dcp_cscl_complex
nysdec_freshwater_wetlands_checkzones
nysdec_freshwater_wetlands
nysdec_tidal_wetlands
nysdec_priority_lakes
nysdec_priority_estuaries
nysdec_priority_streams
nysdec_natural_heritage_communities
nysparks_historicplaces_esri
nysshpo_historic_buildings_points
nysshpo_historic_buildings_polygons
nysshpo_archaeological_buffer_areas
dcp_waterfront_access_map_wpaa
dcp_waterfront_access_map_pow
nysparks_parks_polygons
usnps_parks

Bytes - unknown frequency of update

Both of these found here. They also have the task at the bottom of this issue - they should be renamed because I gave them these horrible unreadable acronyms for some reason

[ ] dcp_wrp_rec
[ ] dcp_wrp_snwa

Socrata

add these to weekly socrata pull if they're not there already

[ ] dpr_forever_wild
[ ] lpc_scenic_landmarks
[ ] lpc_historic_district_areas
[ ] lpc_landmarks
[ ] dpr_parksproperties
[ ] dpr_schoolyard_to_playgrounds
[ ] dcp_edesignation_csv

Script source

[ ] usfws_nyc_wetlands - need to investigate update frequency. This comes from a script because the dataset comes either by state (approaching actual big data) or by watershed. NYC is contained in 4 watersheds, so the script pulls all 4, concatenates them, and archives them

Manual updates

For each of these, we need to figure out both update frequency and if we think that we maybe can pull it ourselves instead

[ ] dcp_air_quality_vent_towers
[ ] dcm_arterial_highways
[ ] panynj_jfk_65db
[ ] panynj_lga_65db
[ ] dcp_beaches
[ ] dob_natural_resource_check_flags
[ ] dcp_pops

QA

This section is a stub for now, but we need to figure out what this looks like moving forward

Versioning

[ ] add logic to dcpy plan to determine version of product from one of the sources (in this case, pluto)
[ ] ensure Data Sources link in app links to a place with version of GFT data is visible (Bytes once we start putting it there)

Cleanup

[ ] rename dcp_wrp_rec and dcp_wrp_snwa to ditch horrible acronyms. Not sure why I did this. Long dataset names are way better than unreadable dataset names

NYCPlanning / data-engineering

Operationalizing GFT #923