Flagging outliers - Githubissues

brittanyblouin / ANCRTAdjust

An R package to adjust routine HIV testing data from antenatal care to reduce bias in estimating HIV prevalence trends

MIT License

2 stars 3 forks source link

Flagging outliers #10

Open ellessenne opened 4 years ago

ellessenne commented 4 years ago

2 SD from the mean seems a small cutoff to define outliers to me (~5% of the data will be classified as outliers, assuming symmetry around the mean). Is there any theoretical justification for that?

Regardless, I would suggest: 1- Allowing the user to customise the threshold, with a sensible default. This could be implemented as an extra argument to flag_outliers(); 2- Including a robust alternative (e.g. using median and inter-quartile range instead of mean and standard deviation), which could also be implemented as an extra argument (e.g. robust = TRUE/FALSE).

seabbs commented 4 years ago

Agree with all of above.

Flagging outliers can be very risky and users need to know what they are getting into.

For this #openjournals/joss-reviews/issues/1740 review

m-maheu-giroux commented 4 years ago

We choose 2 SD to flag outliers as a conservative threshold but agree with both of you that this should be user supplied. We have now included an extra arguments to the function.

The goal of this function is for users to have the capacity to quickly identify potential outliers. It is then up to them, based on their expert knowledge, to verify and correct the data (if applicable). As such, I don't view this as being "risky". We are not being prescriptive about exclusion of observations, we are simply recommending additional verification of data integrity.

ellessenne commented 4 years ago

Great that you added that option, thanks. Could you include the considerations above within the documentation? Making sure that users know that expert knowledge is still required to fully deal with (potential) outliers.