florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
208 stars 22 forks source link

Detection of outliers before implementing binomial test for continuous response variable #398

Open akhileshtayade opened 8 months ago

akhileshtayade commented 8 months ago

Hello Florian!

I have been trying to understand how outliers are detected when a continuous response variable is under consideration. I am having a tough time to understand the following line of code from testOutliers which, I think, is specifically written for the detection of outliers in a continuous response variable:

https://github.com/florianhartig/DHARMa/blob/ea2f76e4786bd77f0a05841d6d1b86e3f71df96e/DHARMa/R/tests.R#L195

(For now, it would be great if we could consider margin to be only lower since it would help me immensely to understand the rationale behind the current implementation.)

From the discussion at #182, I infer that if we simulate a model for nSim times, the probability of an observed value, $y_{i}$, to be the smallest value among nSim + 1 values is $\frac{1}{nSim + 1}$.

However, I can not understand why the value of a DHARMa residual is being compared to the probability of residual being the minimum value from the IID sample of nSim + 1 values? Since an outlier is an observation with DHARMa residual equal to 0 (or 1), then can't we directly use outliers = sum(simulationOutput$scaledResiduals == 0) for lower margin? Are there any special cases where the proposed method would fail?

May be I am thinking too much about this and it could be that the above implementation is written the way it is written because that is an appropriate way to detect outliers in DHARMa versions before 0.3.1 where the DHARMA.ecdf is used to calculate residuals using the traditional method and it works for standard eCDF as well?

The current implementation works without any problem as any observation with DHARMa residual of 0 will lead to evaluation of simulationOutput$scaledResiduals < (1/(simulationOutput$nSim+1)) to be TRUE and considered as outlier. But I was still curious to know if there are any other reasons behind its implementation that I did not mention above.

Thank you!

florianhartig commented 8 months ago

Hello Akhilesh,

thanks for the question! I have to admit that I'm also a bit puzzled as to why this was programmed as it is. Your conjecture that this was introduced because of the old residual definition where outliers were distributed evenly seems plausible to me though.

I will leave this ticket open to give a more thorough check later.

Best, Florian

akhileshtayade commented 8 months ago

Thank you for your response, Florian!

Sincerely, Akhilesh