florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
206 stars 22 forks source link

Outlier test and identification #171

Open klattbk opened 4 years ago

klattbk commented 4 years ago

Hi Florian,

I have two questions concerning the outlier test in DHARMa. My first question is: can it be used to identify outliers (e.g. data points with strong influence) similarly to Cook's Distance, so that a non-significant test states that there is no significant influence of outliers? How do I get to know which point in the data set is creating the outlier, can I get something like a line number or similar?

Thanks and best, Björn

florianhartig commented 4 years ago

Hi Björn,

can it be used to identify outliers (e.g. data points with strong influence) similarly to Cook's Distance, so that a non-significant test states that there is no significant influence of outliers?

No. First of all, note that "outlier" in the context of DHARMa only means outside the simulation envelope, so we don't know how far outside.

How do I get to know which point in the data set is creating the outlier, can I get something like a line number or similar?

Outliers have values 0 or 1, so you can do

sim <- simulateResiduals(fittedModel)
which(residuals(sim) == 1 | residuals(sim) == 0)

That being said - it's perfectly normal that you will have a certain amount of outliers if n is small. If you have 1000 data points and you do 250 simulations in a Poisson, you would expect 4 outliers on average. If you want to remove data points, you should at least increase n to > 5 * datasize to decrease the number of "random" outliers, and only get points that are really outside the distribution.

Note also that, for reasonably continuous distribution, a point could be very far in the tail of the distribution without being quantified as "outlier" by DHARMa (according to the standard outlier definition). If you want to change this, you could also define

sim <- simulateResiduals(fittedModel)
which(residuals(sim) >0.99 | residuals(sim) < 0.01)

or something like that.

In general, I think a measure of influence on the model fit would be more sensible to identify problematic points. Maybe you could have a look if there are generalisations for Cook's distance or similar for the model your are fitting, see e.g. Pinho, L. G. B., Nobre, J. S., & Singer, J. M. (2015). Cook’s distance for generalized linear mixed models. Computational Statistics & Data Analysis, 82, 126–136. doi:10.1016/j.csda.2014.08.008

florianhartig commented 4 years ago

I have added a function to return outliers to the development version of DHARMa. In the help, I give a few more hints about the things that I discuss here