Closed rempsyc closed 2 years ago
@bwiernik I'm planning on submitting 443 before this is resolved, but just pinging you in case you have some opinion on this that I could integrate before the merge.
Which do you want me to look at?
All four questions above. But we can start small, with no 1.
(You can skip the reprex, it was only a demonstration of the point)
Edit: I misunderstood how IQR works. I fixed the comparison of it below.
So that's:
list(
zscore = stats::qnorm(p = 1 - 0.001),
iqr = 1.7,
ci = 0.999,
cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
pareto = 0.7,
mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
ics = 0.001,
optics = 2 * ncol(x),
iforest = 0.001,
lof = 0.001
)
What do folks think?
3 and 4. That seems like a bug. I think it might be returning logical of if the points are outside the threshold and those are coerced to numbers?
Thanks. For reference, here are the other thresholds in case you'd like me to change any other one:
list(
zscore = stats::qnorm(p = 1 - 0.025),
zscore_robust = stats::qnorm(p = 1 - 0.025),
iqr = 1.5,
ci = 0.95,
eti = 0.95,
hdi = 0.95,
bci = 0.95,
cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
pareto = 0.7,
mahalanobis = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
mahalanobis_robust = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
mcd = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
ics = 0.025,
optics = 2 * ncol(x),
iforest = 0.025,
lof = 0.025
)
Ok I have resolved no 3. It is because the threshold is not used on the distance score as in other methods, but in an earlier step to calculate the distance score (which ends up 0 or 1). Therefore, it is working as intended and not a problem.
That only leaves no 4.
About no 4: I think it is not a bug. I will make a reprex shortly to show that it is calculated like this on purpose. However, it does beg the question of whether it is desirable behaviour and what could be changed.
I edited my comment above at the same time you were responding, so I've made some suggested threshold changes there
I'm not really understanding what computation is happening in 4. Can you walk me through it?
Here is the way IQR distances are calculated. Is this a bug or intentional, and should we change it?
x <- mtcars
threshold = 1.5
method = "tukey"
d <- data.frame(Row = seq_len(nrow(as.data.frame(x))))
# This is the baddie function!
for (col in seq_len(ncol(as.data.frame(x)))) {
v <- x[, col]
if (method == "tukey") {
iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE)
} else {
iqr <- stats::IQR(v, na.rm = TRUE)
}
lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold)
upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold)
d[names(as.data.frame(x))[col]] <- ifelse(v > upper, 1,
ifelse(v < lower, 1, 0)
)
}
head(d)
#> Row mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 1 0 0 0 0 0 0 0 0 0 0 0
#> 2 2 0 0 0 0 0 0 0 0 0 0 0
#> 3 3 0 0 0 0 0 0 0 0 0 0 0
#> 4 4 0 0 0 0 0 0 0 0 0 0 0
#> 5 5 0 0 0 0 0 0 0 0 0 0 0
#> 6 6 0 0 0 0 0 0 0 0 0 0 0
So if we decompose this to understand where 0 and 1 are coming from:
col
#> [1] 11
(v <- x[, col])
#> [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
(iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE))
#> 75%
#> 2
(lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold))
#> 25%
#> -1
(upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold))
#> 75%
#> 7
# This is the culprit: an ifelse with 1 or 0 as outcome!
ifelse(v > upper, 1, ifelse(v < lower, 1, 0))
#> [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Created on 2022-08-24 by the reprex package (v2.0.1)
Same ifelse logic for ci methods.
Oh, I was misunderstanding the computation/threshold for the IQR. The existing threshold of 1.5 is equivalent to 2.7 SDs. 1.7 would be equivalent to 3 SDs. I edited the post above.
Would it make sense to report the actual percentile as the IQR distance score, regardless of whether method = tukey
or not, since only the cut-off values would change in my understanding? And would dividing the rank by the total number of observations be the right formula? E.g.,
(v <- mtcars$carb)
#> [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
threshold = 0.2
Distance_IQR <- rank(v, na.last = "keep")/length(v)
And then we can still identify outliers with the upper and lower boundaries as calculated before:
(iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE))
#> 75%
#> 2
(lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold))
#> 25%
#> 1.6
(upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold))
#> 75%
#> 4.4
Outlier_IQR <- (ifelse(v > upper, 1, ifelse(v < lower, 1, 0)))
out <- data.frame(Distance_IQR, Outlier_IQR)
out
#> Distance_IQR Outlier_IQR
#> 1 0.796875 0
#> 2 0.796875 0
#> 3 0.125000 1
#> 4 0.125000 1
#> 5 0.390625 0
#> 6 0.125000 1
#> 7 0.796875 0
#> 8 0.390625 0
#> 9 0.390625 0
#> 10 0.796875 0
#> 11 0.796875 0
#> 12 0.593750 0
#> 13 0.593750 0
#> 14 0.593750 0
#> 15 0.796875 0
#> 16 0.796875 0
#> 17 0.796875 0
#> 18 0.125000 1
#> 19 0.390625 0
#> 20 0.125000 1
#> 21 0.125000 1
#> 22 0.390625 0
#> 23 0.390625 0
#> 24 0.796875 0
#> 25 0.390625 0
#> 26 0.125000 1
#> 27 0.390625 0
#> 28 0.390625 0
#> 29 0.796875 0
#> 30 0.968750 1
#> 31 1.000000 1
#> 32 0.390625 0
Then for the composite score of all columns, we could take the max of all columns (like zscore, etc.), rather than the average.
Created on 2022-08-25 by the reprex package (v2.0.1)
@rempsyc Could you please add a short text to the NEWS file about the changes? I'm not sure I can appropriately summarise your whole work.
I would prefer to use non-thresholded stats for everything and only use the thresholds when drawing threshold lines on plots or when flagging cases
or when flagging cases
Yup that's the purpose of check_outliers, to identify possible outliers
@bwiernik Would it make statistical sense to report the actual percentile as the IQR distance score? What about the CI distance score?
Yes for both of those
Ok so if everyone is alright with that, I'm going to submit a PR with all our agreed changes, including changing the distance scores to percentiles for IQR and all CI methods.
This is low priority, but just putting out there to free up some space in PR #443.
There are essentially four questions relating to thresholds in
check_outliers
and outlier scores:zscore
methods (threshold = 1.96
) are conservative. I wonder whether we should follow the more conventional 3 deviations instead (conventional according to standards in psychology, that is). I don’t know about the appropriateness of the thresholds of the other methods, however. Perhaps it is intentional to have all methods have rather conservative thresholds to strongly encourage users to mindfully choose their own thresholds? Is that it? Clarification here would be appreciated. And if that's the case, I wonder whether that threshold justification would be worth a note in documentation (maybe not; but for novice users that could be helpful).iqr
threshold is set to 1.96 in the general thresholds definition, but to 1.5 per default in.check_outliers_iqr
. Maybe this is not a big deal but: would it be OK to harmonize the thresholds to the primary definition (e.g., in documentation, but also in.check_outliers_thresholds
) so they are all consistent? I feel like that would be less confusing.iqr
values don’t seem to get bigger than 1 anyway (same for the confidence interval methods)? They are either 0 or 1. Is that normal? I would appreciate clarification on the use of thresholds > 1 if their values indeed cannot be greater than 1. To be clear, I am not referring to the average of all columns for those methods, but when it is applied for each column (reprex below).Created on 2022-08-16 by the reprex package (v2.0.1)