NYCPlanning / data-engineering-qaqc

streamlit app for data engineering
https://edm-data-engineering.nycplanningdigital.com
1 stars 0 forks source link

DevDB QAQC: Spike Organization Ticket #114

Closed abrieff closed 1 year ago

abrieff commented 2 years ago

Proposed Work

New QAQC Table

DevDB Side

Because the existing QAQC is immutable, we would like to build a new QAQC report on top of it, which would accomplish two main goals: filtering out records with no flags, and attaching additional variables (Job Type, Job Status, etc.) for filtering on the app side.

App Side

Dataframe Display, with options to filter by key fields(Job Type, Job Status) and by flag type, to view the job numbers of records with each particular flag.

QAQC Version Aggregation Report

DevDB Side

New report that includes one row per version, and totals the number of records flagged per QAQC check.

App Side

Graphs displaying the # of flagged records for each version and QAQC check

QAQC Flag Change over Time

DevDB Side

Could use some direction on the exact format of this - the idea of checking how the flags on records change over time, and how that maps onto the manual review process (ie: Excluding records that have been checked and are known to be fine or fixed)

Field Distribution Reports

DevDB Side

For Job Status and Job Type, create a new report that includes the # of new records of each possible type created since the previous version, to be able to map over versions anomalies in the assigning of statuses or types

QAQC Side

Graph for each field showing the distribution of new records by field value, with multiple lines for previous versions

Complete Year/Complete Quarter Report

DevDB Side

There seem to be two separate checks here, one on a per-record basis, and one on the aggregate table.

  1. For each record (perhaps to be included in the above new QAQC table) are the complete date, complete quarter and complete year all in alignment?)
  2. What is the distribution of records by complete quarter? Are there any records with the most recent quarter present?

    App Side

    Graph showing the distribution of completed records by quarter

ClassA_Net Check

DevDB Side

Add an additional flag to the new QAQC table for checking that classa_init - classa_prop = classa_net

Manual Corrections

App Side

Graphs and Dataframe Displays of Manual Corrections Applied and Not Applied by field

abrieff commented 2 years ago

In our discussion, somehow I'd missed final_qaqc

QAQC Table Columns

a. job_number, b. b_likely_occ_desc - checking occ_initial, occ_proposed, and job_desc, c. b_large_alt_reduction - Alteration where classa_net unit change < -5 d. b_nonres_with_units - nonresidential (according to occ_initial or proposed) with non-zero classa_init or _prop, e. units_co_prop_mismatch - checks for all new buildings/alterations places where classa_prop is not equal to co_latest_units, f. partially_complete - job status - 4. Partially Completed, g. units_init_null - demolitions and alterations in residential with null classa_init, h. units_prop_null - new buildings and alterations in residential with null proposed units, j. units_res_accessory - looks to be highlighting units around ADUs, either proposed or initially, k. outlier_demo_20plus - Demolitions where there are larger than 19 classa_init units in the building, l. outlier_nb_500plus - New Building where the number of classa_prop proposed units > 499, m. outlier_top_alt_increase - Whether its one of the 20 largest alterations by classa_net increase in units, n. dup_bbl_address_units, o. dup_bbl_address, p. inactive_with_update - Date Last Updated > Last Captured Updated Date and job_inactive is set to new value, q. no_work_job - checks against a number of job_descriptions and job types to determine if job has work to be done, r. geo_water - whether its in water s. geo_taxlot- if the bbl is missing t. geo_null_latlong - latlong is null u. geo_null_boundary - any of the boundaries (fire district, etc.) are missing v. invalid_date_filed - when last filed date is not a date or is before 1990, w. invalid_date_lastupdt - when last updated date '', x. invalid_date_statusd - when statusd date '', y. invalid_date_statusp - when statusp date '', z. invalid_date_statusr - when statusr date '', aa. invalid_date_statusx - when statusx date '', ab. incomp_tract_home - tracthomes flag is Y and job status in 1, 2, 3, ac. dem_nb_overlap - checking duplicates between new buildings and demolitions

abrieff commented 2 years ago

Linking to another past brainstorm about QAQC https://github.com/NYCPlanning/db-developments/issues/428

abrieff commented 2 years ago

linking to past work on getting hed input https://github.com/NYCPlanning/db-developments/issues/33 - some of the descriptions in the document may be worth re-evaluating for additional QAQC checks to add, but probably represents an avenue for v2 rather than initial page

abrieff commented 2 years ago

From HED: Field Groupings: Units, Class B, Outliers, Duplicates, Invalid Dates, Other, Geography, dem_nb_overlap

abrieff commented 2 years ago

Checking flags (as problematic) across versions -

AmandaDoyle commented 2 years ago
abrieff commented 2 years ago

moved to top of ticket

td928 commented 2 years ago

would it be okay for me to add my comments to your post @abrieff ?

Let me try to write it down here first as well:

Manual Corrections

App Side

None of the Corr_HNY_Matches records with add to certain job numbers should not be part of the HNY_no_match. If it does show up in the no match list, then the corrections did not got applied correctly.

abrieff commented 2 years ago

sure

SashaWeinstein commented 2 years ago

What does "display job number list for each of the qaqc checks" mean? Does that mean select a check and see all job numbers that are "yes" for that check?

abrieff commented 2 years ago

Yes (any alternative ideas welcome though)

SashaWeinstein commented 2 years ago

Looking at

New Report in DevDB with one row per version, that aggregates each of the boolean flags to track change over time

Do we want this to work like pluto where each new version is appended to a table pulled down from the main folder in DO? This approach takes some time to implement (faster now that I've done it for pluto) but it's probably the most robust. It also allows us to compare the results of different branches for a given version.

abrieff commented 2 years ago

Yes, that was my thoughts for it

SashaWeinstein commented 2 years ago

Wrote up more of my thoughts on this subject https://github.com/NYCPlanning/db-developments/issues/518

abrieff commented 2 years ago

Perfect thank you

damonmcc commented 1 year ago

closing because all linked tasks are closed