Outlier data in Property Values

lyzidiamond commented 9 years ago

There are a couple properties that have really insanely high values -- like > $7,000,000. If we pull those values out we might get a better distribution. First I think we should re-geocode the original data to see if we're missing some middling values, but I wanna know what you guys think.

@eeeschwartz @livienyin

eeeschwartz commented 9 years ago

ok yeah we can pull my house out no bigs

eeeschwartz commented 9 years ago

Truthfully I don't have a great sense of how to approach that problem. I'd be interested to talk to someone with more experience

lyzidiamond commented 9 years ago

The original suggestion to pull out the outliers came from @buckleytom, who maybe can help?

livienyin commented 9 years ago

How did Charlotte handle outliers? Seems dishonest to take 'em out entirely but I understand the issue with skewing the distribution... let's take additional suggestions from those who are experienced with this type of issue.

lyzidiamond commented 9 years ago

Removing outliers is a common statistical practice. The only thing that's dishonest about it is not disclosing that you removed them, so we'd make sure to include that information on the page.

livienyin commented 9 years ago

Sounds good, I'm not very familiar with statistical practices. Might be significant if we ever have voting precincts with multiple outliers? Probably not the case atm.

lyzidiamond commented 9 years ago

Doing some cursory research, there is also the option of adding another transformation on top of the analysis to display the variance. But that gets into the weeds on a problem we are already having with this project. Ultimately, using the same algorithm to distribute every dataset could be considered dishonest in itself :wink:

tbuckl commented 9 years ago

Hey guys, sorry, I just saw this. How about user-testing it? That is, show people with and without outliers, and ask them what they understand of it? In fact, you could do this among yourselves. I just suggested it because usually you use sweet graphixxxx to explain things to people and if all your graphics are doing is highlighting that Erik is so much richer than everyone else then maybe you can just have a picture of erik in the corner jumping into a pile of coins.

daguar commented 9 years ago

The approach I'd take to this is to do some sensitivity analysis: compare the outputs of (a) outliers included against (b) outliers removed.

The criterion I'd use to assess it is asking yourself — "what's the top-level insight someone should be taking away from the data?"

Then you can look at (a) and (b) and ask which best visually conveys that insight.

Concretely with relative property value levels, I imagine the biggest issue is that the high-value outliers are screwing up the color gradient buckets? (i.e., neighborhoods with variation in the middle look too similar because the scale is going way up?)

If that's the case, it's totally defensible to say "we exclude these outliers because (a) the highest-value neighborhoods still appear as the highest-value one, but (b) it's easier to see the differences between other neighborhoods [which was lost in the original]."

Then again it's totally possible that I completely misunderstand your problem and am drinking coffee and saying stuff that is totally irrelevant. In which case, here's a corgi.

corgi

tbuckl commented 9 years ago

i agree with dave about the corgi.

also, lyzi, here's a graph with a log-scaled x axis.

codeforamerica / lexington-qold

Outlier data in Property Values #56