Data pre-processing: Outlier Detection

hhan14 commented 6 years ago

Proposal of OUTLIER DETECTION

Concept: After loading/importing datasets, by using the Outlier Detection function, following outcomes/results can be shown: (1) whether there are outliers in datasets, (2) features containing outliers, and (3) portion of outliers in the datasets as well as in each feature without creating graphs.
Objective: (1) To better understand the results of analysis/machine learning models which could be polluted/distorted from outliers. (2) To adjust/modify data pre-processing from potential bias on the datasets.
Why valuable: (1) Reduce time and resources spending to interpret results (2) In case when new observations as added, the function can make it easier to grasp how much impact the new data added make and in what way.
Features of software function: ranging, filtering, scaling, and regularizing,,, etc. *It is an initial stage of idea generation at the moment, feel free to add/suggest any thoughts/opinions.

nirveshk commented 6 years ago

Very interesting idea. Having a function/syntax in place to determine the presence of outliers and possibly be able to get rid of such values, can greatly impact the accuracy of the model. I can't wait to see how you will tackle coding part of this. I believe, you will have to set a cut off on data point on both the lower and the higher end to retain the data that fall within and ignore/flag other data as possible outliers. Please share your idea on how you would accomplish this.

choikwun commented 6 years ago

This might be helpful with general descriptive stats on the dataset. Like what about reporting mean median/quartiles with/without the dataset so you can see the differences between having and not having the outliers?

nitieaj commented 6 years ago

I imagine compare the largest data-point and the mode of the column. If greater than a certain magnitude of the mode , then that data point maybe an outlier!

choikwun commented 6 years ago

How about a more statistical approach? like maybe its outside of 2 sd it's considered an outlier? Because that would be pretty much outside of the normal distribution (assuming its normal)

EHWUSF / HS68_2018_Project_1

Data pre-processing: Outlier Detection #14