Closed rempsyc closed 9 months ago
As you will notice from the email transcription above, we cannot resubmit as a LaTeX file, and will indeed have to move to a Word processor file while monitoring changes using a blue font rather than track-change. I will send the link to the Google Doc by email, but we can decide to communicate here if desired.
@bwiernik, our footnote 5 on Cook method's default threshold goes:
Our default threshold for the Cook method is defined by
stats::qf(0.5, ncol(x), nrow(x) - ncol(x))
, which again is an approximation of the critical value for p < .001 consistent with the thresholds of our other methods.
Reviewer 2 writes,
Footnote 5: please explain why you use here the value 0.5, this is not clear to me.
I believe you were the one to suggest this threshold. What would you suggest adding to the footnote to answer the reviewer's concern?
@strengejacke and @IndrajeetPatil, is there anything from the checklist you would like to tackle/get assigned?
@DominiqueMakowski, Reviewer 2 writes:
I do not find the random questionnaire example convincing, please look out for a better example in section 2.2.
We currently have:
However, in many scenarios, variables of a data set are not independent, and an abnormal observation will impact multiple dimensions. For instance, a participant giving random answers to a questionnaire. In this case, computing the z score for each of the questions might not lead to satisfactory results. Instead, one might want to look at these variables together.
One common approach for this is to compute multivariate distance metrics such as the Mahalanobis distance.
Looking back in the commit history, you were the one to add this example, so I am assigning this point to you.
@mattansb, Reviewer 2 writes,
Lines 97-99: What do you mean by t-tests being multivariate? If I consider a one-sample t-test, what is not univariate there? Also, I find the word multivariable weird, should it not be multivariate?
We have:
However, univariate methods can give false positives since t tests and correlations, ultimately, are also models/multivariable statistics. They are in this sense more limited, but we show them nonetheless for educational purposes.
This was based on an early comment from you:
<!!-- MSB: t-tests and correlations are model/multivariable statistics, so univariate outlier methods might give false-positives... -->
So I am assigning this point to you.
First, I would invite the authors to extend a little bit their introduction in order to underline the problematic ways researchers currently deal with outliers. For example, the authors could briefly introduce a "made-up" or real example of a dataset for which different types of outliers are identified according to different methods and/or the different possibilities in which they could be treated.
This is a great idea. Does anyone have a (raw/uncleaned) cross-sectional dataset they're willing to share? We can build this up and use this also in the examples in the check_outliers()
docs.
Does anyone have a (raw/uncleaned) cross-sectional dataset they're willing to share? We can build this up and use this also in the examples in the check_outliers() docs.
I got a couple open raw data sets on OSF, but most are experimental rather than cross-sectional, not sure if they would be suitable for what you had in mind—we could take one of those if there are no better suggestions...
Okay, I cooked up this example that shows the lack of agreement between univariable, multivriable, and model-based methods. I'm sure if I played with this longer, I could make them overlap less. But maybe this is enough?
Not sure how to build a not confusing legend here 🤷♂️
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(performance)
#> Warning: package 'performance' was built under R version 4.3.2
update_geom_defaults("point", aes(size = 3))
theme_set(
theme_bw()
)
# Data --------------------------------------------------------------------
data <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 60),
y = c(-2, 0, 2, 6, 5, 7, 30, 8, 9, 10,
11, 13, 14, 13, 15, 16, 17, 17, 19, 18,
21, 23, 21, 24, 24, 26, 27, 30, 27, 61))
# Outlier detection -------------------------------------------------------
# Univariate methods
data$univ_outlier <- check_outliers(data, method = c("zscore"))
# Multivariate methods
data$multiv_outlier <- check_outliers(data[,1:2], method = c("mahalanobis"))
# Model-specific methods
model <- lm(y ~ x, data = data)
data$model_outlier <- check_outliers(model, method = "cook")
# Plot ---------------------------------------
data
#> x y univ_outlier multiv_outlier model_outlier
#> 1 1 -2 FALSE FALSE FALSE
#> 2 2 0 FALSE FALSE FALSE
#> 3 3 2 FALSE FALSE FALSE
#> 4 4 6 FALSE FALSE FALSE
#> 5 5 5 FALSE FALSE FALSE
#> 6 6 7 FALSE FALSE FALSE
#> 7 7 30 FALSE TRUE TRUE
#> 8 8 8 FALSE FALSE FALSE
#> 9 9 9 FALSE FALSE FALSE
#> 10 10 10 FALSE FALSE FALSE
#> 11 11 11 FALSE FALSE FALSE
#> 12 12 13 FALSE FALSE FALSE
#> 13 13 14 FALSE FALSE FALSE
#> 14 14 13 FALSE FALSE FALSE
#> 15 15 15 FALSE FALSE FALSE
#> 16 16 16 FALSE FALSE FALSE
#> 17 17 17 FALSE FALSE FALSE
#> 18 18 17 FALSE FALSE FALSE
#> 19 19 19 FALSE FALSE FALSE
#> 20 20 18 FALSE FALSE FALSE
#> 21 21 21 FALSE FALSE FALSE
#> 22 22 23 FALSE FALSE FALSE
#> 23 23 21 FALSE FALSE FALSE
#> 24 24 24 FALSE FALSE FALSE
#> 25 25 24 FALSE FALSE FALSE
#> 26 26 26 FALSE FALSE FALSE
#> 27 27 27 FALSE FALSE FALSE
#> 28 28 30 FALSE FALSE FALSE
#> 29 29 27 FALSE FALSE FALSE
#> 30 60 61 TRUE TRUE FALSE
data <- data |>
mutate(
any_outlier = interaction(model_outlier, multiv_outlier, univ_outlier)
)
b <- coef(model)
ol_name <- "Outlier Type"
ol_labels <- c("(Not)", "Multivariable or Model", "Multivariable or Univariable")
ggplot(data, aes(x, y)) +
geom_abline(intercept = b[1], slope = b[2],
linewidth = 1, color = "royalblue") +
geom_point(aes(color = any_outlier, shape = any_outlier)) +
scale_shape(ol_name, labels = ol_labels) +
scale_color_discrete(ol_name, labels = ol_labels)
Created on 2023-12-19 with reprex v2.0.2
Do we already have a response letter document?
I actually see now that my example is very similar to Figure4 in the paper. @rempsyc perhaps we can just use that example (or some variation on that)? I can't actually find the code...
Do we already have a response letter document?
We do now! I just sent it by email :)
I actually see now that my example is very similar to Figure4 in the paper. @rempsyc perhaps we can just use that example (or some variation on that)? I can't actually find the code...
The code is actually just above Figure 4, but on the previous page (in the paper and google doc), but it is just 4 lines of code. Because our example was about height and weight, I used a base R dataset that had precisely those variables and just added artificial outliers. That said, although your code is longer, your figure is prettier because of the legend and geom shapes.
One issue I have with this reviewer's comment is that, as you point out, we already do this comparison in the relevant section (Cook’s Distance vs. MCD), after explaining the methods. I feel like going into an extensive method comparison at the very beginning before having introduced the methods would be a bit out of order.
I guess he just wants an example of a clearly wrong but common approach to outlier detection. I think it would be mostly to support our assertion that researchers treat outliers with incorrect strategies:
Yet, despite the existence of established recommendations and guidelines, many researchers still do not treat outliers in a consistent manner, or do so using inappropriate strategies
So we could give an example of a researcher who uses the commonly used +/3 SD, and how it identified an outlier when it shouldn't have, and missed an actual outlier.
But how much overlap should there be with the height and weight example? Should we swap them places? Should we only use code without a figure? If we do swap them and include the figure, perhaps in the Cook’s Distance vs. MCD section we could simply refer back to the example from the intro? I started a short paragraph draft in the paper to get us thinking.
Yet, despite the existence of established recommendations and guidelines, many researchers still do not treat outliers in a consistent manner, or do so using inappropriate strategies
This doesn't mean that any method is wrong, per se. I might be biased, but (as I made clear in my first pass on the draft) all these methods should merely be used as suggestive since, objectively, there generally isn't a ground truth (which is also why I personally prefer a non-automated, knowledge-based outlier inspection/rejection).
Thus, different methods can be judged by their usefulness to do ... something.
But of course the data is the data, in real heavy tailed distributions, especially in small samples, all of these methods can result in falsely flagging actual representative values (which IMO is the point of outlier detection).
Here is a random sample from a true DGP of $y \sim Cauchy(x, 1)$ in which all methods flag the same observation.
So maybe we can have a paragraph about this general idea (the points above), that applying outlier detection methods automatically without thinking of their usefulness and what they're designed for is what is the bad practice. We can then add my figure or your figure to illustrate the point. I think this will also correspond well with the first paragraph of the "Handling Outliers" section.
WDYT?
Woooaw, @mattansb your new Figure 1 in the paper is amazing!!!! Should be in a textbook! But this outlet is good too ;)
The caption is long but very good I think... It is quite detailed for something coming on the third paragraph of the paper (with all the thresholds etc.), but at the same time I think it setups the rest of the paper, and this is exactly what Reviewer 1 asked.
I think I first wrote the paper with the Leys/Lakens papers in mind, which have strong titles like "Do not use standard deviation around the mean, use absolute deviation around the median" and which include statements such as (in the abstract) "this method is problematic."
Now, we might decide to tune down the tone of the paper to clarify that no method is wrong per se, and instead invite researchers to be more mindful of the selected method.
So maybe we can have a paragraph about this general idea (the points above), that applying outlier detection methods automatically without thinking of their usefulness and what they're designed for is what is the bad practice. We can then add my figure or your figure to illustrate the point. I think this will also correspond well with the first paragraph of the "Handling Outliers" section.
I thought we already kind of did this, but after rereading the paper, it seems we don't! I think it is important that you can capture all (or most) of your thoughts/feelings about outliers in this paper since it might become a reference, so let's do it. If you wanted, we could make this its own section (you suggest placing it before the Handling Outliers section), and you could even include your Cauchy code example (if you find it useful). You will see in the paper for now I've added a temporary section called "Are Some Methods “Wrong”?", feel free to improve it :)
@DominiqueMakowski do you think you'll be able to tackle Reviewer 1's comment about Bayesian stats soon? I'm hoping to resubmit the paper by the end of January. Let me know your timeline and if you think this could be possible.
i'm quite swamped right now, but I can look into the better example than the questionnaire issue. For the Bayesian question, I'm not sure what the reviewer is talking about I need to read this Ciccione (2023) first, I'll add that on my to-do list
Reviewer 2 comments,
In Figure 1 I find it weird to see an aggregate score, please explain this better.
Here's my attempt to explain the aggregate score as seen on the figure (now Figure 2):
Note. The distance represents an aggregate score for variables mpg, cyl, disp, and hp. In this case, the aggregate score represents a given participant’s (1-34) highest robust z score among the tested variables. The resulting unique value (representing one of mpg, cyl, disp, or hp for that participant) is then rescaled to a range of 0 to 1 by dividing by the value of the participant with the highest score.
Maybe it is the "aggregate" term that is confusing. Maybe we could rename it as a "Highest deviation per participant" or something like that because it's not really aggregating but rather showing the most extreme
Ok, congrats all, we've managed to address almost all issues raised by reviewers 🥳 the only thing left is the two points assigned to Dom 😛 We'll be able to resubmit as soon as Dom gets to it
Well done, can you confirm where is the latest version so that I can take a stab at it?
Just sent you the email with Google doc link again ;)
I wrote something for the second issue, but for the first one it might require adding a more general parapraph on regularization if I'm understanding correctly (cf. my comment in the answers google docs)
Congrats team, we've addressed all points 😙 (thanks Dom for this last sprint!). @strengejacke, would you like to review the response to reviewers? With your blessing (and perhaps of the paper as well), I can then submit on our behalf 🤓