Outlier detection - Githubissues

pauleanderson commented 13 years ago

In the PCA window there are options for various plots (e.g., scatter plot, 1 SE, 1 SD, and 2 SD). Isaie and I are reviewing data for 332 samples and searching for possible outliers. We are viewing a scatter plot with sample IDs shown, and also viewing the Mean ± 2 SD plots. We are searching for data points that lie outside of the 2SD limits.

It would be nice to have another plot option: Outlier Plot. This would show a PCA plot with Mean ± 2 SD boundary and, additionally, it would show any data points that lie at or beyond the 2SD limit with the data ID. There may be other methods as well to determine outliers that we could consider using here. I think there is something called a Gibbs test for outliers?? Don't remember exactly.

RadixSeven commented 13 years ago

I believe you mean Grubb's test. You must assume that the underlying distribution is Gaussian, however.

RadixSeven commented 13 years ago

Checking Wikipedia, Grubb's has several more complications:

1) It is a univariate test. Our domain is multivariate.

2) It only detects one outlier. However, there are ways of getting it to find more than one.

3) It also has problems with small sample sizes.

DaManDOH commented 13 years ago

Alternative: Generalized extreme Studentized deviate test: http://itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm Turned up after a quick Google search. Can't claim to know anything about it yet. Digesting...

Additionally, regardless of the test, couldn't we just run it once on each dimension of our data and then pick a failure threshold -- dynamically or fixed -- to cut the data point loose? And if we're talking about detecting outliers in already PCA'd data, aren't we reducing dimensionality before we plot? Seems if none of these tests can detect outliers in 2D and 3D data, they aren't of much use.

RadixSeven commented 13 years ago

Outlier elimination is a dangerous business. Sometimes it is a good idea, but it is something that should be done with great care. Outlier removal is a bias-variance tradeoff move. By eliminating outliers, you reduce the variance of the models generated. However, you also bias the models toward whatever distribution underlies your outlier detection criterion. If that distribution is really there, the elimination is a good move. But if you do not know that such a distribution underlies things, you have a problem.

Outlier elimination is best when there are reasons to believe that some points did not come from the population under study. For example, suppose you learn that one of your study participants had 4 hearts and was traveling the universe in a telephone booth, it would pay to seek out outliers. Or, suppose you know that your measurement techniques are unreliable and sometimes generate large deviations from what they are supposed to measure. Then you apply these thoughts to your model to increase its ability to find the pattern while discarding the noise. But with the same act you also make it more likely that the model will find a pattern where there is only noise - a great danger in the high dimensional spaces we work in.

If you are going to consider outliers, the safest way to deal with them is manually. First, you use some automatic outlier detection framework. Then for each outlying point, you examine the provenance to try and understand what could have caused it to be so different. If you have no reason to throw it out, you keep it.

DaManDOH commented 13 years ago

lol Sorry... meant to just comment and hit the wrong button. ;{)>

So... is that an endorsement or rejection of the generalized ESD test as a univariate, "thresholded" outlier detector?

Whether or not we want to allow the user to ignore detected outliers, we still need an algorithm to produce an initial list/plot/weighted graph, do we not? In fact, we could parameterize the generalized ESD test to look for outliers up to a max percentage of the given datapoints. Additionally, wouldn't "dealing with [outliers] manually" defeat the purpose of an outlier plot for anything except the most rudimentary of documentation purposes?

As Paul mentioned at the head of this thread, we're looking to plot this stuff in 3D PCA space. So before we even get to the outlier detection we've already reduced dimensionality after trying to preserve as much variance as we can. Thus, we've already introduced some bias, correct?

I fully admit, I could be misunderstanding Paul's initial feature request...

RadixSeven commented 13 years ago

Sorry, the misunderstanding was more mine. I had gotten distracted from the feature request portion and was reacting to the first part of what Paul wrote -- detecting outliers in the specific context of the current study with Isaie.

Outlier detection is a good idea as a feature and we should probably provide a default method and the option to try a few different methods. Outlier detection is like a chain-saw, as a tool-maker, we should build it and make it powerful because when you need it, nothing else can do the job quite as well. As tool users, we should recognize that even doing our best to use it properly, there is still a great potential to get hurt. I'm sorry I got distracted above from my role of tool-maker.

In implementing it, we do need to ensure there is a way to drill down from any detected outliers back to their original data.

Re: dealing with outliers manually the entire reason one makes an outlier plot is to deal with them manually. Automatic outlier culling does not require a plot, you just say: remove the outliers and they're gone, no interaction required.

Re: detecting in only the most significant PCA dimensions adding bias Yes it does add bias. We may want to give the user the option of detecting in both the original space and the reduced space and then plotting the detected outliers in the reduced space. However, I don't know how well the outlier algorithms perform in high-dimensional spaces, so this option may not really be feasible.

DaManDOH commented 13 years ago

Groovy. Let me look into implementing a generalized ESD test algorithm.

BiRG / Metabolomics-Analysis-Toolbox

Outlier detection #14