NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
24 stars 1 forks source link

Operationalizing GFT #923

Open fvankrieken opened 5 months ago

fvankrieken commented 5 months ago

As GFT is put into production, we need to figure out what making sure it functions operationally is going to look like. There main big parts are updating source data and QA, with some other trailing tasks as well

Source Data

Categorizing by source type and how we update them currently

Pluto

Bytes quarterly updates

These are pretty straightforward. Lion has one issue, its parquet doesn't get created without error at the moment, so I've run manually when it's needed

"ceqr app" data

Each of these is something else a bit under the hood. Needs some investigation. For now, we can "build" ceqr data

This includes

ArcGIS Feature Service

These can have version programmatically determined, so maybe should be pulled on weekly basis

Datasets here are

Bytes - unknown frequency of update

Both of these found here. They also have the task at the bottom of this issue - they should be renamed because I gave them these horrible unreadable acronyms for some reason

Socrata

add these to weekly socrata pull if they're not there already

Script source

Manual updates

For each of these, we need to figure out both update frequency and if we think that we maybe can pull it ourselves instead

QA

This section is a stub for now, but we need to figure out what this looks like moving forward

Versioning

Cleanup

fvankrieken commented 4 months ago

dcp_wrp_rec dcp_wrp_snwa both seem to actually be able to pull from a feature server (see download links in bytes). We should just pull from that if we aren't already (and add to the arcgis feature server regular pulls)