RoboticsClubIITJ / ML-DL-implementation

An implementation of ML and DL algorithms from scratch in python using nothing but NumPy and Matplotlib.
BSD 3-Clause "New" or "Revised" License
49 stars 69 forks source link

Implement One hot encoding #41

Closed SaiSrichandra closed 3 years ago

SaiSrichandra commented 4 years ago

Hello, I would like to add One hot Encoding in utils

rohansingh9001 commented 4 years ago

Please specify what exactly are you trying to do and which functions and classes you are adding. One hot encoding is already used by some models.

Player0109 commented 3 years ago

Can I work on This?

Player0109 commented 3 years ago

Here is the sample code of what I will do:

class OneHotEncoder():

def fit(self, X, thresh):
    n = X.shape[0]
    m = X.shape[1]

    self.arr_dic = []
    self.arr_nunique = []
    self.encode = []
    for i in range(m):
        ls = np.unique( X[:,i] )
        n_unique = len(ls)
        dic = dict()

        if n_unique/n > thresh:
            self.encode.append(0)

        else :
            self.encode.append(1)
            for j,val in enumerate(ls):
                dic[val] = j

        self.arr_dic.append(dic)
        self.arr_nunique.append(n_unique)       

def transform(self, X):

    n = X.shape[0]
    m = X.shape[1]

    self.out_X = []
    for i in range(m):
        if self.encode[i]==0: 
            self.out_X.append(X[:,i].reshape(n,1))
            continue
        n_unique = self.arr_nunique[i]
        dic = self.arr_dic[i]
        col = np.zeros((n,n_unique))
        enc_col = [dic[x] for x in X[:,i]]
        col[np.arange(n),enc_col] = 1
        self.out_X.append(col)

    self.out_X = np.concatenate(self.out_X, axis=1)
    return self.out_X

def fit_transform(self, X, thresh):
    self.fit(X, thresh)
    return self.transform(X)
bhanuexcalibur commented 3 years ago

Can i work on this?

rohansingh9001 commented 3 years ago

On the basis of FCFS @Player0109 You can work on this. Make sure to make your implementation mathematically correct. You can ask for more detail in the Gitter Channel. If something is not clear in the issue.

Player0109 commented 3 years ago

@rohansingh9001 In which file and folder do I have to implement this class?

rohansingh9001 commented 3 years ago

@Player0109 in the MLlib/utils/misc_utils.py file. If required we will shift your code to a more suitable place in the future.

However, I would like to discuss the API of your class. Can you explain your Class and the methods within it with their parameters and what is its output?

Player0109 commented 3 years ago

class OneHotEncoder():

#FUNCTIONS
#(1) FIT(INPUT_X, THRESHOLD) --- It is used to calculate the number of unique values in each column and tell whether a particular column should be encoded or not.
#(2) CHECK_TRANSFORM(INPUT_X) --- It is used to check whether the data which is being transformed has same values as the data which was used to fit it.
#(3) TRANSFORM(INPUT_X) --- It is used to OneHotEncode the data based on the data which was used to fit
#(4) FIT_TRANSFORM(INPUT_X, THRESHOLD) --- This fuction is just a combination of the fit and the transform fuction.

#INPUTS
# X - It is a numpy array of size n x m.
# thresh - It is a threshold value which is calulated as THRESH = (NUMBER OF UNIQUE VALUES IN A COLUMN)/(LENNGTH OF COLUMN). Column whose threshold value is below the input threshold value which be encode otherwise not.

#VARIABLES
# ncols - It is used to store the number of columns in the fit data.
# arr_dic - Is is an array of dictionary, where each dictionary is the LabelEncoded value of a particular column.
# arr_nunique - It is an array which  is used to store the number of unique values in a particular column.
# encode - It is an array of the size of the number of columns in the fit data. It has a value of 1 if the columns is to be encoded otherwise 0.

def fit(self, X, thresh):
    n = X.shape[0]
    m = X.shape[1]

    self.ncols = m
    self.arr_dic = []
    self.arr_nunique = []
    self.encode = []
    for i in range(m):
        ls = np.unique( X[:,i] )
        n_unique = len(ls)
        dic = dict()

        if n_unique/n > thresh:
            self.encode.append(0)

        else :
            self.encode.append(1)
            for j,val in enumerate(ls):
                dic[val] = j

        self.arr_dic.append(dic)
        self.arr_nunique.append(n_unique)       

def check_transform(self, X):
    m = X.shape[1]

    if self.ncols!=m:
        print("Number of Columns in input data of fir and transform are different")
        return False

    for i in range(m):
        ls = np.unique(X[:,i])
        n_unique = len(ls)

        if self.arr_nunique[i]!=n_unique:
            print('Mismatch in the number of unique values in the '+str(i)+'th column')
            return False
        for val in ls:
            if val not in self.arr_dic[i].keys():
                print(str(i)+'th column contain a value which was not in the data used to fit data')
                return False

    return True

def transform(self, X):

    check = self.check_transform(X)
    if check==False:
        return None

    n = X.shape[0]
    m = X.shape[1]

    self.out_X = []
    for i in range(m):
        if self.encode[i]==0: 
            self.out_X.append(X[:,i].reshape(n,1))
            continue
        n_unique = self.arr_nunique[i]
        dic = self.arr_dic[i]
        col = np.zeros((n,n_unique))
        enc_col = [dic[x] for x in X[:,i]]
        col[np.arange(n),enc_col] = 1
        self.out_X.append(col)

    self.out_X = np.concatenate(self.out_X, axis=1)
    return self.out_X

def fit_transform(self, X, thresh):
    self.fit(X, thresh)
    return self.transform(X)
Player0109 commented 3 years ago

@rohansingh9001 can you Plz review the above code.

rohansingh9001 commented 3 years ago

@Player0109 I like your approach, but as far as a proper review is concerned, pasting code in a comment is not the proper way to ask for reviews. Therefore, I request you to submit a Pull Request with these changes. That is how proper reviews made on GitHub.

There are a small few changes I want to request. I will do so once you submit a proper PR. Also, make sure that you follow the PEP8 programming standards. To check if your code is correct, use tools like flake8 on your code. Otherwise, your PR will fail the automated tests we have in place.