RoboticsClubIITJ / ML-DL-implementation

An implementation of ML and DL algorithms from scratch in python using nothing but NumPy and Matplotlib.
BSD 3-Clause "New" or "Revised" License
48 stars 69 forks source link

Implement "Numerical outlier method" , to detect Anomaly/Outlier points in Dataset #89

Closed Halix267 closed 3 years ago

Halix267 commented 3 years ago

@agrawalshubham01 Can I work on this

agrawalshubham01 commented 3 years ago

@Halix267 which method are you going to implement ?

Halix267 commented 3 years ago

IQR method @agrawalshubham01 .

Halix267 commented 3 years ago

for detection of outliers

agrawalshubham01 commented 3 years ago

Kindly draft your api here.

Halix267 commented 3 years ago

could u brief ? I m not getting

agrawalshubham01 commented 3 years ago

@Halix267 draft your model code here.

Halix267 commented 3 years ago

def dataset_developer(n, probability=[0.05, 0.05, 0.15, 0.5, 0.15]): X = [] while len(X)<n: t = random.random() if t<probability[0]: X.append(random.random()0.2) elif t>=probability[0] and t<probability[1]+probability[0]: X.append(random.random()0.2+0.2) elif t>= probability[1]+probability[0] and t<probability[0]+probability[1]+probability[2]: X.append(random.random()0.2+0.22) elif t>=probability[0]+probability[1]+probability[2] and t<probability[0]+probability[1]+probability[2]+probability[3]: X.append(random.random()0.2+0.23) else: X.append(random.random()0.2+0.24) return X

def numeric_outlier(X, k): Q1 = X[int((len(X)+1)/4)] Q3 = X[int(((len(X)+1)3)/4)] IQR = Q3 - Q1 outliers_index = [i for i in range(len(X)) if X[i]<Q1-kIQR or X[i]>Q3+k*IQR] return outliers_index

def imager(X, Y, outlier_X, outlier_Y, l): X = random.sample(X, len(X)) outlier_X = random.sample(outlier_X, l) plt.scatter(X, Y, color='green', marker='.') plt.scatter(outlier_X, outlier_Y, color='red', marker='*') plt.plot()

def accuracy(outliers_index, X): accuracy = 0 for j in range(len(X)): if j in outliers_index: if X[j]<0.5 or X[j]>0.9: accuracy += 1 else: if X[j]>=0.5 and X[j]<=0.9: accuracy += 1 return accuracy/len(X)

def main(X, k): outliers_index = numeric_outlier(X, k) outliers = [X[i] for i in outliers_index] X_c = [X[i] for i in range(len(X)) if i not in outliers_index] imager(range(len(X_c)), X_c, range(len(X_c)), outliers, len(outliers_index)) return accuracy(outliers_index, X)

k = 0.45 random.seed(0) X = dataset_developer(1000) X.sort() acc = main(X, k) print(acc)

Halix267 commented 3 years ago

@agrawalshubham01 done . Now what

agrawalshubham01 commented 3 years ago

What I am trying to say is how would you create class, what would be methods and attributes, how could one use these classes. This is a python package just we need to define related functions using class so that we can inherit its properties. How could end user access it. I got the logic, what I am asking is how would you draft your code such that some one else can use this

@rohansingh9001 Though I think logic looks fine, help him with drafting an API for the same. We can keep this in a folder like Preprossesers

Halix267 commented 3 years ago

Yes guide me for drafing the API for the same @agrawalshubham01 @rohansingh9001

rohansingh9001 commented 3 years ago

@Halix267 sorry for the delay, I was busy in other important commitments.

However, in an issue try to be as descriptive as you can. There are no resources given for us to understand what Numerical Outlier method is or what can it do. Even if we do have knowledge about this algorithm from background education, we still do not know how robust your solution will be or what all it can do.

Secondly, I would recommend you to go through the code in the Examples directory. It contains examples of how an end-user can use our library.

A user should be able to import your code and apply on his custom dataset.

For example, the Linear Regression model has a .fit() method which trains the model on the data given to it. There are also other various methods in it. While your code might logically be correct, please provide what sort of data this model needs to train, what kind of predictions it can make and what functions the end user will call to train this model and run it.

That is what @agrawalshubham01 meant by discussing an API.

Halix267 commented 3 years ago

ok @rohansingh9001 I wil provide the dataset but first can u guide me How i can proceed on this issue

Halix267 commented 3 years ago

@rohansingh9001

And about your question imagine the user is preprocessing the dataset and in that dataset suppose there is height column in feet...

And the values are 1.2, 1.3 , 4, 5.1, 5.6, 6.3, 41.2, 50 .....

clearly 1.2,1.3 and 41.2 , 50 are the outliers which results in decreasing model performance.

So to improve the model performance user can find the outliers and exclude them accordingly

Halix267 commented 3 years ago

@rohansingh9001 @agrawalshubham01 Plzz look at this , Can I start working on this issue?

kwanit1142 commented 3 years ago

Resources :-

https://naysan.ca/2020/06/28/interquartile-range-iqr-to-detect-outliers/ <-------------------- (First, See this implementation)

https://towardsdatascience.com/why-1-5-in-iqr-method-of-outlier-detection-5d07fdc82097 <--------(Logic for implementation)

ssiddharth27 commented 3 years ago

I would like to work on this issue

kwanit1142 commented 3 years ago

Sure 👍

kwanit1142 commented 3 years ago

Thanks to @Siddharth-Singh27, the issue's been solved. Hereby, closing it.