Closed lyzidiamond closed 6 years ago
ok yeah we can pull my house out no bigs
Truthfully I don't have a great sense of how to approach that problem. I'd be interested to talk to someone with more experience
The original suggestion to pull out the outliers came from @buckleytom, who maybe can help?
How did Charlotte handle outliers? Seems dishonest to take 'em out entirely but I understand the issue with skewing the distribution... let's take additional suggestions from those who are experienced with this type of issue.
Removing outliers is a common statistical practice. The only thing that's dishonest about it is not disclosing that you removed them, so we'd make sure to include that information on the page.
Sounds good, I'm not very familiar with statistical practices. Might be significant if we ever have voting precincts with multiple outliers? Probably not the case atm.
Doing some cursory research, there is also the option of adding another transformation on top of the analysis to display the variance. But that gets into the weeds on a problem we are already having with this project. Ultimately, using the same algorithm to distribute every dataset could be considered dishonest in itself :wink:
Hey guys, sorry, I just saw this. How about user-testing it? That is, show people with and without outliers, and ask them what they understand of it? In fact, you could do this among yourselves. I just suggested it because usually you use sweet graphixxxx to explain things to people and if all your graphics are doing is highlighting that Erik is so much richer than everyone else then maybe you can just have a picture of erik in the corner jumping into a pile of coins.
The approach I'd take to this is to do some sensitivity analysis: compare the outputs of (a) outliers included against (b) outliers removed.
The criterion I'd use to assess it is asking yourself — "what's the top-level insight someone should be taking away from the data?"
Then you can look at (a) and (b) and ask which best visually conveys that insight.
Concretely with relative property value levels, I imagine the biggest issue is that the high-value outliers are screwing up the color gradient buckets? (i.e., neighborhoods with variation in the middle look too similar because the scale is going way up?)
If that's the case, it's totally defensible to say "we exclude these outliers because (a) the highest-value neighborhoods still appear as the highest-value one, but (b) it's easier to see the differences between other neighborhoods [which was lost in the original]."
Then again it's totally possible that I completely misunderstand your problem and am drinking coffee and saying stuff that is totally irrelevant. In which case, here's a corgi.
i agree with dave about the corgi.
also, lyzi, here's a graph with a log-scaled x axis.
There are a couple properties that have really insanely high values -- like > $7,000,000. If we pull those values out we might get a better distribution. First I think we should re-geocode the original data to see if we're missing some middling values, but I wanna know what you guys think.
@eeeschwartz @livienyin