Open hhan14 opened 6 years ago
Very interesting idea. Having a function/syntax in place to determine the presence of outliers and possibly be able to get rid of such values, can greatly impact the accuracy of the model. I can't wait to see how you will tackle coding part of this. I believe, you will have to set a cut off on data point on both the lower and the higher end to retain the data that fall within and ignore/flag other data as possible outliers. Please share your idea on how you would accomplish this.
This might be helpful with general descriptive stats on the dataset. Like what about reporting mean median/quartiles with/without the dataset so you can see the differences between having and not having the outliers?
I imagine compare the largest data-point and the mode of the column. If greater than a certain magnitude of the mode , then that data point maybe an outlier!
How about a more statistical approach? like maybe its outside of 2 sd it's considered an outlier? Because that would be pretty much outside of the normal distribution (assuming its normal)
Proposal of OUTLIER DETECTION