EHWUSF / HS68_2018_Project_1

0 stars 9 forks source link

Data pre-processing: Outlier Detection #14

Open hhan14 opened 6 years ago

hhan14 commented 6 years ago

Proposal of OUTLIER DETECTION

nirveshk commented 6 years ago

Very interesting idea. Having a function/syntax in place to determine the presence of outliers and possibly be able to get rid of such values, can greatly impact the accuracy of the model. I can't wait to see how you will tackle coding part of this. I believe, you will have to set a cut off on data point on both the lower and the higher end to retain the data that fall within and ignore/flag other data as possible outliers. Please share your idea on how you would accomplish this.

choikwun commented 6 years ago

This might be helpful with general descriptive stats on the dataset. Like what about reporting mean median/quartiles with/without the dataset so you can see the differences between having and not having the outliers?

nitieaj commented 6 years ago

I imagine compare the largest data-point and the mode of the column. If greater than a certain magnitude of the mode , then that data point maybe an outlier!

choikwun commented 6 years ago

How about a more statistical approach? like maybe its outside of 2 sd it's considered an outlier? Because that would be pretty much outside of the normal distribution (assuming its normal)