NYCPlanning / data-engineering-qaqc

streamlit app for data engineering
https://edm-data-engineering.nycplanningdigital.com
1 stars 0 forks source link

COLP QAQC: Tickets #164

Open Oysters1874 opened 2 years ago

Oysters1874 commented 2 years ago

Proposed Works

Graphs -- App Side

Display number of records by agency/usetype.
Tasks:

Version-to-version comparison -- App Side (and Maybe COLP Side)

We can display version-to-version changes in the number of records per use type. As the table already exists, we can only look at the app side.

We can follow the design of the CPDB page for this section.

Outlier Report -- App Side

With two existing qaqc tables, ipis_modified_hnums & ipis_modified_names, we can display the records with relevant fields with modified house numbers and parcel names

Geospatial Check -- Both COLP and App Side

Check whether all properties are within NYC borough boundaries.

Manual Corrections Check - App Side

We can display graphs and dataframe of Manual Corrections Applied and Not Applied by field, just like what PLUTO has done.

Current QAQC tables:

- Identifying invalid data in IPIS:

  1. ipis_unmapped: unmappable input records
  2. ipis_modified_hnums: records with modified house numbers
  3. ipis_modified_names: records with modified parcel names
  4. ipis_colp_geoerrors: addresses that return errors from 1B
  5. ipis_sname_errors: addresses that return streetname errors from 1B
  6. ipis_hnum_errors: addresses that return address errors from 1B
  7. ipis_bbl_errors: records where address isn't valid for BBL
  8. ipis_cd_errors: mismatch between IPIS community district and PLUTO

- Version-to-version comparison for COLP review:

  1. usetype_changes: version-to-version changes in the number of records per use type
abrieff commented 2 years ago

Do you have a sense of what of this work should be done on the pipeline vs. app side?

Oysters1874 commented 2 years ago

Do you have a sense of what of this work should be done on the pipeline vs. app side?

Yea, I can mark that as well. But so far, I think all of these existing QAQC tables are uploaded to DO. For invalid records, mismatches, and version-to-version comparison, we can directly display them on the app side.

abrieff commented 2 years ago

👍

AmandaDoyle commented 2 years ago

The reports that you have outlined are very useful:

The following may not be so useful:

This is more general, but for COLP and other data products we like to check that all of the geospatial values are in sync, for example does the first number in the BIN, BBL, and CD match the boro code, perhaps there is a way to incorporate this type of check into COLP QAQC and think about how to design it so that it is easy to replicate across data products.

I'm happy to meet to talk anything though if helpful. Looking forward to seeing this