Open ellessenne opened 4 years ago
Agree with all of above.
Flagging outliers can be very risky and users need to know what they are getting into.
For this #openjournals/joss-reviews/issues/1740 review
We choose 2 SD to flag outliers as a conservative threshold but agree with both of you that this should be user supplied. We have now included an extra arguments to the function.
The goal of this function is for users to have the capacity to quickly identify potential outliers. It is then up to them, based on their expert knowledge, to verify and correct the data (if applicable). As such, I don't view this as being "risky". We are not being prescriptive about exclusion of observations, we are simply recommending additional verification of data integrity.
Great that you added that option, thanks. Could you include the considerations above within the documentation? Making sure that users know that expert knowledge is still required to fully deal with (potential) outliers.
2 SD from the mean seems a small cutoff to define outliers to me (~5% of the data will be classified as outliers, assuming symmetry around the mean). Is there any theoretical justification for that?
Regardless, I would suggest: 1- Allowing the user to customise the threshold, with a sensible default. This could be implemented as an extra argument to
flag_outliers()
; 2- Including a robust alternative (e.g. using median and inter-quartile range instead of mean and standard deviation), which could also be implemented as an extra argument (e.g.robust = TRUE/FALSE
).