Approaches to outliers?

jslandes commented 2 months ago

Reading through the lecture notes on variance and covariance, there are a few pages on how outliers (especially extreme ones) affect covariance. I assume the same issue extends to even dramatizing or masking relationships. I scanned the rest of the lectures for Module 2 but didn't see anything about how to mitigate the impact of outliers in real projects.

Questions I have:

What approaches can be used to mitigate the impact of outliers?
When is it appropriate (or not appropriate) to use these approaches?
If adjusting or excluding outliers, should we perform calculations both with and without outliers? Transparency-wise, is it necessary to communicate about the approach used with clients or a lay audience, or is it enough to say we adjusted for or excluded outliers?

Would love some recommended reading/resources from trusted sources. A simple search is just bogged down with AI links, so I have no idea where to find trustworthy introductory info on this

matillm2 commented 2 months ago

I'm interested in the answers to this also. For all the examples such as the outlier examples in the lecture, is there a spot where "answers" are given? There are several pages of "how about now?" and I'd like to see how all those scenarios are worked out.

nickmcmullen commented 2 months ago

@jslandes @matillm2 Great question. In more advanced applications and studies, it may be appropriate to take a systematic approach to mitigating the impact of outliers depending on what you're trying to do. If doing so, it's important to define exactly the approach you took and the logic behind it so the audience reading the analysis has a thorough understanding of the methodology.

We also want to make sure we aren't over-engineering an outcome by removing outliers. Outliers in and of themselves tell a story about behavior we see in the data, so another practical approach would be to simply run your models without any sort of outlier mitigation, acknowledge specific cases that are outliers and why you think they are outliers, and describe how they may be impacting the analysis and conclusions.

There is no one perfect solution out there for this. In many ways, that's what makes regression modeling such a fun, imperfect science!

jslandes commented 2 months ago

@nickmcmullen Thanks!

@matillm2 If you're interested in any reading, I found some key phrases for boolean searches: "outlier detection" and "outlier treatment." The World Bank had some good introductory slides, but they have a univariate focus instead of bivariate like we've been working on. I still thought it was a good introduction to thinking about outliers: https://thedocs.worldbank.org/en/doc/20f02031de132cc3d76b91b5ed8737d0-0050012017/related/lecture-12-1.pdf

lecy commented 2 months ago

Outliers are lumped in with the “specification bias” lecture because as Prof McMullen points out mitigation is rarely about just dropping inconvenient data points. It’s more common to run the models with and without outliers and present both together in order to quantify the impact of outliers on the estimation but not simply assume they don’t belong there.

Interestingly, the size of an outlier is often less important than where it is located relative to the “fulcrum” of the regression line (the point where the mean of X and mean of Y meet). Outliers can impact just the standard errors, or both slopes and standard errors, depending on location. They can also inflate impact (make the slope larger) or mask impact (make the slope look too small). See slides 16-17.

https://github.com/DS4PS/cpp-523-sum-2021/raw/main/lectures/p-09-specification.pdf

Economists like using a modified standard error called a “robust standard error” that inversely weights the impact of each data point on the standard error relative to its distance from the mean (outliers are weighted less and thus contribute less to the standard error, minimizing their impact). They do nothing to correct the slopes, though. So depending upon the nature of the bias you may just have more confidence in the wrong answer (haven’t fixed bias in slope but the standard errors are smaller meaning you are more likely to achieve statistical significance).

I 100% agree with Prof McMullen that there is no perfect solution, which is what makes regression fun in the sense that it takes practice to do it really well. In general I would try to avoid overly-prescriptive approaches to specification: if you encounter problem A then always use technique Z.

Instead, approach it reductively: is there any other way I can explain coefficient b1 besides the actual impact that X is having on Y? If you have a few outliers can you work through the logic on slides 16-17 to reason through whether they could be inflating your impact and inducing false confidence in the program or if they more likely producing models that undersell the true impact?

I am generally skeptical of regression guides that suggest the fix to every regression problem is a more complicated model. It’s like saying the way you fix a bad driver is by giving them a faster car.

I am WAY more persuaded by an analyst that says, we’ve identified some outliers that predominantly cluster in the bottom right quadrant, thus we expect them to have the following effects on our model... In fact, when we re-run the model with the outliers omitted we see that the slope is larger and standard errors have decreased.

Versus what I see way more often: “Due to concerns about outliers the table contains results from a model that uses M-estimation and robust standard errors after winsorizing the data.” It shows less understanding about the specific problem they are trying to address (whether we should be more worried about slope bias or standard errors given the type of outlier), and whether the solutions target the problem at all.

Watts-College / paf-510-template

Approaches to outliers? #53