NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
20 stars 0 forks source link

PLUTO QA enhancements, expand `dcpy` repeat build functionality #835

Closed fvankrieken closed 4 months ago

fvankrieken commented 4 months ago

Does NOT actually close #657,

Closes #538

Part of #831

Ended being a bunch of little stuff in here, so will go commit by commit

QAQC Enhancements (#538) See issue for screenshots

  1. Move logic to a component. Mainly wanted as part of commit 2 but this makes diffs in commit 2 easier to see
  2. Add selectbox that lets you choose whether the other "diff" being shown is the previous, or previous of the same build type (major vs minor). The current build is a good example. We built 24v2. The prev is 24v1.1, so our "current" diff is "24v2 - 24v1.1". The former default, "previous previous" diff would be "24v1.1 - 24v1". But if we want to compare this to the last "major" diff instead, we can now do that, where the comparison diff will be "24v1 - 23v3.1" instead. There are cases where these would be the same - if we didn't have a minor last build and current was "24v2 - 24v1", either option would be "24v1 - 23v3.1". Same if we have 24v2.2 - both will be "24v2.1 - 24v2" This works for minors as well. 24v1.1 (with diff "24v1.1 - 24v1") can either look at diff "24v1 - 23v3.1" or "23v3.1 - 23v3" This commit includes a lot of renaming - instead of just "v1" "v2" and "v3" thought it's best to describe. So "version"/"v", "previous", "comparison", and "comparison_previous" seemed intuitive to me. "previous" always describing the directly preceding build, and "comparison" describing the build whose diff we want to compare to.
  3. Adding new fields to outlier reports so that GIS can filter them out of the tables. So adding an additional column to the QA outputs as well as a checkbox in the app to filter rows. This checkbox is grayed out (with a useful tooltip) if the build doesn't have the new columns
  4. add a few comments and axes labels as requested

Non-correction or QA changes

  1. don't vacuum in pluto build. This took a little while, and was added to attempt to reclaim space in constrained github runner temp postgres instances. Our db seems to have aggressive autovacuuming set up, so this seems unneeded
  2. we've moved away from using .envs in this manner - I personally would rather run set_env manually before a build. Maybe worth broader discussion
  3. Fix a bug introduced in #801 (or rather not entirely fixed - the fix was needed in 3 lines, I only added it on 2)
  4. Rework "repeat" build functionality. I wanted to be able to repeat a draft, rather than only published versions, since once GIS begins QAing a dataset, it seems like we should be locking versions of all input datasets as to not introduce potentially new errors. This was tough with the way I initially implemented this. I decoupled normal recipe-planning and this repeat logic since they require fairly different inputs, and we can't have dynamic inputs to github actions
fvankrieken commented 4 months ago

Passing regular build here

Passing repeat build here

fvankrieken commented 4 months ago

LGTM! I like the plan_command concept

🤞🏾 for nightly QA after this is merged

Nightly QA is passing already! It's the weird random job failures on pushes

fvankrieken commented 4 months ago

Just rebased main, will make sure tests pass before merging