NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Pluto 23v3.1 QA issues #405

Closed fvankrieken closed 5 months ago

fvankrieken commented 6 months ago

@jackrosacker flagged on teams earlier this week

@AmandaDoyle fyi

Flagging an issue we've found in our MapPLUTO QA. Looks like some tax lot changes have been incorporated into the most recent 23v3.1 minor build. Briefly, it looks like the 23v3.1 build has 6 lots more than the 23v3 build, and further that there is a mismatch of ~60 BBLs when joining on that column. From an initial look, the mismatches look mostly to do with lot splits/merges and condo creation.

Version Source Record Count
23v3.1 source 857,042
23v3 source 857,036

23v3 being pulled from published version 23v3, 23v3.1 from draft fvk-pluto-versioning. These record counts are using the output mappluto_unclipped_gdb.gdb.zip

A couple specific bbls - from @jackrosacker querying data loaded to the gis sde, the following bbls were present in 23v3 but not in 23v3.1

1001240008 1007380008 1013460012 1014290003 1014290044 1015700017 1021520083 2025620049 2027240032 2029460047 2030920003 2031430282 2044270013 2044270014 2044270015 2044270016 2059060342 3001190055 3002787502 3004350047 3009340069 3010400013 3011120012 3018400001 3023660032 3025200001 3036870020 3036970033 3048140026 3049940013 3056210005 3056820045 3063250030 3064110022 3067670090 4004300008 4004300010 4004300011 4004300012 4009100001 4010780051 4050667502 4050667505 4076217501 5010080049 5014957501 5024017501 5024507504 5030030025 5030247501 5031737504 5038320021 5065800056 5065800058 5070557502 5079390056 5080400021

A specific case, there's a polygon that seems to be consistent yet change bbl from 3002787502 to 3002787501

fvankrieken commented 6 months ago

I've verified that source data versions are consistent for all that should be - only zoning data sources are updated from 23v3 to 23v3.1

fvankrieken commented 6 months ago

Data available to be investigated right now

fvankrieken commented 6 months ago

I've done a bit of digging which I'll try to summarize in a bit, but as a bit of a first step am going to rebuild pluto with identical inputs to 23v3. We'll first need to compare outputs to 23v3, identify anything that is truly nondeterministic. Then we can start comparing intermediate tables to 23v3.1 and try to figure out where issues are arising from, whether they are from semi-nondeterministic code (akin to condo proxcode issue) or "intended" behavior - some sort of filtering at some step that new zoning data rightfully changes for a certain lot, etc

fvankrieken commented 6 months ago

Resolved 3002787502 morphing into 3002787501. These are billing/primebbls which share a base_billing_bbl (in dtm_condo). dtm only has one polygon for the basebbl and nothing smaller that could correspond to 7501 and 7502. @AmandaDoyle thinks this is likely a dtm issue, that hopefully it will "catch up" with reality at some point soon.

Nothing should be done to code, since multiple base bbls per billing is not valid. However, it might be worth adding code to at least make this deterministic (grab smallest primebbl associated with basebbl).

Logic around this is in dtmmergepolygons.sql

fvankrieken commented 6 months ago

24 of these are condo base bbls that shouldn't have been in 23v3 to begin with. As of this moment, unsure how they ended up in it

condo_base_bbl condo_billing_bbl
1007380008 1007387505
1013460012 1013467504
1015700017 1015707501
1021520083 1021527501
2025620049 2025627501
2027240032 2027247503
2059060342 2059067502
3001190055 3001197501
3004350047 3004357506
3009340069 3009347513
3010400013 3010407502
3011120012 3011127504
3018400001 3018407502
3023660032 3023667502
3025200001 3025207501
3048140026 3048147506
3049940013 3049947501
3056210005 3056217510
3056820045 3056827502
3063250030 3063257502
3064110022 3064117504
3067670090 3067677502
4009100001 4009107502
5038320021 5038327501
fvankrieken commented 6 months ago

Simpler issue than I thought, at least in terms of why the source data seems different. See timestamps

image

If it looks like a lot change, it's probably a lot change.

We recently discussed that when pushing draft outputs, we should be cleaning (deleting) any existing files in s3 prior to pushing. This was an insidious case of a silent failure in the October build on @sf-dcp 's branch, where the gdb export failed. set e either isn't working as we intended within the function call (we need to maybe explicitly capture the gdal error and then exit), or gdal itself failed silently. But outputs were made MINUS the gdbs, which were then pushed to s3. Old outputs overwritten except for the gdbs, which remained