NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
19 stars 0 forks source link

Data Library Overhaul #484

Open fvankrieken opened 7 months ago

fvankrieken commented 7 months ago

Motivations

Data library is a great starting point for the "extract" portion of dcpy, but there are multiple ways its not meeting our needs. Our main area of focus is data quality, both on the passive/automated side of things as well as reactive. I.e., asking questions like

Goals

Essentially, we want to go from the current process, roughly this: image

To something more like this image

With a couple key notes

damonmcc commented 5 months ago

notes from Roadmapping on 2/28

idea for ideal "extract" (aka data-library) process

  1. "source" section: archival of raw data
  2. "ingestions" of raw data into standard "transformation" format
  3. running of processing steps
    • custom per dataset (current library scripts)
    • tabular data cleaning
      • read columns as string, lpad w/ zeroes
      • NAs as empty strings
    • geospatial
      • geocoding via geosupport
      • make_valid
      • set crs
  4. export from "transformation" format to any desired formats
  5. push to S3

next steps