AntoineSoetewey / statsandr

A blog on statistics and R aiming at helping academics and professionals working with data to grasp important concepts in statistics and to apply them in R. See www.statsandr.com
http://statsandr.com/
35 stars 15 forks source link

blog/outliers-detection-in-r/ #38

Closed utterances-bot closed 3 years ago

utterances-bot commented 3 years ago

Outliers detection in R - Stats and R

Learn how to detect outliers in R thanks to descriptive statistics and via the Hampel filter, the Grubbs, the Dixon and the Rosner tests for outliers

https://statsandr.com/blog/outliers-detection-in-r/

AntoineSoetewey commented 3 years ago

"Comment written by Felix Kluxen on August 17, 2020 09:27:12:

Dear Antoine,

thank you for this helpful post.

Just my two cents: I think it sometimes makes sense to formally distinguish two classes of outliers: extreme values and mistakes. Extreme values are statistically and philosophically more interesting, because they are possible but unlikely responses -- such as in your height example. Hawkins considers outliers as values that deviate so much from other observations one might suppose a different underlying sampling mechanism - which is another interesting take on this.

Cheers, Felix

Hawkins, D. M., 1980. Identification of outliers. Chapman and Hall, London ; New York."

AntoineSoetewey commented 3 years ago

"Comment written by Felix Kluxen on August 17, 2020 09:27:12:

Dear Antoine,

thank you for this helpful post.

Just my two cents: I think it sometimes makes sense to formally distinguish two classes of outliers: extreme values and mistakes. Extreme values are statistically and philosophically more interesting, because they are possible but unlikely responses -- such as in your height example. Hawkins considers outliers as values that deviate so much from other observations one might suppose a different underlying sampling mechanism - which is another interesting take on this.

Cheers, Felix

Hawkins, D. M., 1980. Identification of outliers. Chapman and Hall, London ; New York."

Comment written by Antoine Soetewey on August 17, 2020 10:32:36:

Dear Felix,

Thanks for your comment, the article has been updated accordingly (see first and fourth paragraph of the introduction). Feel free to let me know if there is any inconsistency.

Regards,
Antoine

AntoineSoetewey commented 3 years ago

"Comment written by Felix Kluxen on August 17, 2020 09:27:12: Dear Antoine, thank you for this helpful post. Just my two cents: I think it sometimes makes sense to formally distinguish two classes of outliers: extreme values and mistakes. Extreme values are statistically and philosophically more interesting, because they are possible but unlikely responses -- such as in your height example. Hawkins considers outliers as values that deviate so much from other observations one might suppose a different underlying sampling mechanism - which is another interesting take on this. Cheers, Felix Hawkins, D. M., 1980. Identification of outliers. Chapman and Hall, London ; New York."

Comment written by Antoine Soetewey on August 17, 2020 10:32:36:

Dear Felix,

Thanks for your comment, the article has been updated accordingly (see first and fourth paragraph of the introduction). Feel free to let me know if there is any inconsistency.

Regards, Antoine

Comment written by Felix Kluxen on August 17, 2020 11:30:30:

Excellent! The elephant in the room with statistically identified outliers (here values that are probably not mistakes) is obviously that you cannot solve the issue of what researchers should do with the information - as you write. This really depends on the research question, eg subsets, responder/non-responder etc, and usually involves a suprising amount of needed reflection on the researcher's side... or the willingness to think the model assumptions through. If a statistical test result relies on a single influential value this should caution the researcher to make overambitious claims.

Cheers, Felix

AntoineSoetewey commented 3 years ago

"Comment written by Felix Kluxen on August 17, 2020 09:27:12: Dear Antoine, thank you for this helpful post. Just my two cents: I think it sometimes makes sense to formally distinguish two classes of outliers: extreme values and mistakes. Extreme values are statistically and philosophically more interesting, because they are possible but unlikely responses -- such as in your height example. Hawkins considers outliers as values that deviate so much from other observations one might suppose a different underlying sampling mechanism - which is another interesting take on this. Cheers, Felix Hawkins, D. M., 1980. Identification of outliers. Chapman and Hall, London ; New York."

Comment written by Antoine Soetewey on August 17, 2020 10:32:36: Dear Felix, Thanks for your comment, the article has been updated accordingly (see first and fourth paragraph of the introduction). Feel free to let me know if there is any inconsistency. Regards, Antoine

Comment written by Felix Kluxen on August 17, 2020 11:30:30:

Excellent! The elephant in the room with statistically identified outliers (here values that are probably not mistakes) is obviously that you cannot solve the issue of what researchers should do with the information - as you write. This really depends on the research question, eg subsets, responder/non-responder etc, and usually involves a suprising amount of needed reflection on the researcher's side... or the willingness to think the model assumptions through. If a statistical test result relies on a single influential value this should caution the researcher to make overambitious claims.

Cheers, Felix

Comment written by Antoine Soetewey on August 17, 2020 12:15:18:

You're totally right, outliers require thoughtful reflection and caution for many statistical analyses!

DLAtatem commented 3 years ago

Dear Antoine This is very helpful indeed. I just found a key to detecting outliers formally for my project, thanks to this write up Many thanks Duncan

AntoineSoetewey commented 3 years ago

Dear Antoine This is very helpful indeed. I just found a key to detecting outliers formally for my project, thanks to this write up Many thanks Duncan

Glad you find it useful!

DLAtatem commented 3 years ago

Hi Antoine Its been. Actually am looking for more on winsorizing outliers in R by replacing them rather than deleting them. Any guidance will be very helpful Kind regards

AntoineSoetewey commented 3 years ago

Comment written by vijayarajamanickam on December 03, 2020 12:26:17:

Dear Antonie,

I tried to detect outliers using  this script

out <- boxplot.stats(dat$hwy)$out

out_ind <- which(dat$hwy %in% c(out))
out_ind#### .

Most of them are working well, but in some cases it showing Integer(0).
Could you please help me in this?

Many thanks
vijay

AntoineSoetewey commented 3 years ago

Comment written by vijayarajamanickam on December 03, 2020 12:26:17:

Dear Antonie,

I tried to detect outliers using  this script

out <- boxplot.stats(dat$hwy)$out

out_ind <- which(dat$hwy %in% c(out)) out_ind#### .

Most of them are working well, but in some cases it showing Integer(0). Could you please help me in this?

Many thanks vijay

Comment written by Antoine Soetewey on December 03, 2020 18:00:30:

Dear,

When you have the result:
integer(0)

it simply means that there is no outlier according to this method.

If you run boxplot(dat$hwy), you will see that there is no potential outliers as defined by this method.

Hope this helps.

Regards,
Antoine

AntoineSoetewey commented 3 years ago

Hi Antoine Its been. Actually am looking for more on winsorizing outliers in R by replacing them rather than deleting them. Any guidance will be very helpful Kind regards

If you do not want to simply remove outliers, you can indeed use "Winsorization" which is a technique to replace extreme data values with less extreme values.

See for instance the Winsorize() function in R, or this article.

Hope this helps.

Regards, Antoine

DLAtatem commented 3 years ago

Antoine Many thanks. This is helpful

regards duncan