NYCPlanning / db-housing

1 stars 3 forks source link

Flag as outlier #27

Closed mqli322 closed 6 years ago

mqli322 commented 6 years ago

Where units = 9999

AmandaDoyle commented 6 years ago

@mqli322 @AmandaEyer qc_outliers.sql is the logic for how outliers are currently flagged - it's flagging the top 10 records with the greatest units_net count, but it seems that 10 is too small and not capturing these other jobs. @AmandaEyer Should I up the count to 20? There are 18 records where units_net = 99

AmandaDoyle commented 6 years ago

or change the logic so that it flags all jobs where the units_net is greater than a certain value

mqli322 commented 6 years ago

both jobs had 9999 in u_init and u_prop. do we care about flagging outliers where the underlying data is clearly wrong, even though there's no impact on u_net?

AmandaEyer commented 6 years ago

I think that if we know that a record is wrong, we should flag it as an outlier or correct it, whether or not there's an impact?

Yes, I think top 20 is good for outlier checking. Those require manual checks then, correct?

AmandaDoyle commented 6 years ago

Okay so right now in qc_outliers we're only pulling the top 20 and bottom 20 units_net

So should we also include the top 20 and bottom 20 from u_init and u_prop?

AmandaEyer commented 6 years ago

No. I checked and I don't think that add value. Perhaps just mark as outlier any record where u_init or u_prop is 9999?

AmandaDoyle commented 6 years ago

I rather not because that is so specific. How about we just leave qc_outliers.sql as is Where an outlier is -- 1) top 20 and bottom 20 for units_net -- 2) units_net_complete is 50+ units greater than units proposed (NBs only)

AmandaEyer commented 6 years ago

Okay.

So, what do we do with the 9999 records. Just "correct" them in the corrections sheet, maybe?

for 1) that is the logic for checking IF they are valid outliers, not actually marking them as outliers, correct? Because out of the 20 that Jackie and Hannah identified as potential outliers, only 4 records turned out to actually be outliers. They still did a manual check after the first outlier flag.

AmandaDoyle commented 6 years ago

Re 9999 I can include the top 20 and bottom 20 from u_init and u_prop Or I can do nothing and you can find these outliers on your own and add them to the dcp_attributes.csv

For outliers, yes these are outputted to qc_outliers.csv for your review

AmandaEyer commented 6 years ago

Okay, please include only the TOP 20 from u_init and u_prop

AmandaDoyle commented 6 years ago

okay

AmandaEyer commented 6 years ago

Sorry I changed my opinion! wrote too quickly.

AmandaDoyle commented 6 years ago

... okay so what would you like?

AmandaEyer commented 6 years ago

the TOP 20 from u_init and u_prop

AmandaDoyle commented 6 years ago

okay

AmandaEyer commented 6 years ago

thanks!

AmandaDoyle commented 6 years ago

Added logic in qc_outliers to take top 20 and bottom 20 from units_init and units_prop

AmandaDoyle commented 6 years ago

Ah didn't get that. Okay only flagging top 20 from u_init and u_prop, but not the bottom 20.