Closed SaiSrichandra closed 3 years ago
Please specify what exactly are you trying to do and which functions and classes you are adding. One hot encoding is already used by some models.
Can I work on This?
Here is the sample code of what I will do:
class OneHotEncoder():
def fit(self, X, thresh):
n = X.shape[0]
m = X.shape[1]
self.arr_dic = []
self.arr_nunique = []
self.encode = []
for i in range(m):
ls = np.unique( X[:,i] )
n_unique = len(ls)
dic = dict()
if n_unique/n > thresh:
self.encode.append(0)
else :
self.encode.append(1)
for j,val in enumerate(ls):
dic[val] = j
self.arr_dic.append(dic)
self.arr_nunique.append(n_unique)
def transform(self, X):
n = X.shape[0]
m = X.shape[1]
self.out_X = []
for i in range(m):
if self.encode[i]==0:
self.out_X.append(X[:,i].reshape(n,1))
continue
n_unique = self.arr_nunique[i]
dic = self.arr_dic[i]
col = np.zeros((n,n_unique))
enc_col = [dic[x] for x in X[:,i]]
col[np.arange(n),enc_col] = 1
self.out_X.append(col)
self.out_X = np.concatenate(self.out_X, axis=1)
return self.out_X
def fit_transform(self, X, thresh):
self.fit(X, thresh)
return self.transform(X)
Can i work on this?
On the basis of FCFS @Player0109 You can work on this. Make sure to make your implementation mathematically correct. You can ask for more detail in the Gitter Channel. If something is not clear in the issue.
@rohansingh9001 In which file and folder do I have to implement this class?
@Player0109 in the MLlib/utils/misc_utils.py file. If required we will shift your code to a more suitable place in the future.
However, I would like to discuss the API of your class. Can you explain your Class and the methods within it with their parameters and what is its output?
class OneHotEncoder():
#FUNCTIONS
#(1) FIT(INPUT_X, THRESHOLD) --- It is used to calculate the number of unique values in each column and tell whether a particular column should be encoded or not.
#(2) CHECK_TRANSFORM(INPUT_X) --- It is used to check whether the data which is being transformed has same values as the data which was used to fit it.
#(3) TRANSFORM(INPUT_X) --- It is used to OneHotEncode the data based on the data which was used to fit
#(4) FIT_TRANSFORM(INPUT_X, THRESHOLD) --- This fuction is just a combination of the fit and the transform fuction.
#INPUTS
# X - It is a numpy array of size n x m.
# thresh - It is a threshold value which is calulated as THRESH = (NUMBER OF UNIQUE VALUES IN A COLUMN)/(LENNGTH OF COLUMN). Column whose threshold value is below the input threshold value which be encode otherwise not.
#VARIABLES
# ncols - It is used to store the number of columns in the fit data.
# arr_dic - Is is an array of dictionary, where each dictionary is the LabelEncoded value of a particular column.
# arr_nunique - It is an array which is used to store the number of unique values in a particular column.
# encode - It is an array of the size of the number of columns in the fit data. It has a value of 1 if the columns is to be encoded otherwise 0.
def fit(self, X, thresh):
n = X.shape[0]
m = X.shape[1]
self.ncols = m
self.arr_dic = []
self.arr_nunique = []
self.encode = []
for i in range(m):
ls = np.unique( X[:,i] )
n_unique = len(ls)
dic = dict()
if n_unique/n > thresh:
self.encode.append(0)
else :
self.encode.append(1)
for j,val in enumerate(ls):
dic[val] = j
self.arr_dic.append(dic)
self.arr_nunique.append(n_unique)
def check_transform(self, X):
m = X.shape[1]
if self.ncols!=m:
print("Number of Columns in input data of fir and transform are different")
return False
for i in range(m):
ls = np.unique(X[:,i])
n_unique = len(ls)
if self.arr_nunique[i]!=n_unique:
print('Mismatch in the number of unique values in the '+str(i)+'th column')
return False
for val in ls:
if val not in self.arr_dic[i].keys():
print(str(i)+'th column contain a value which was not in the data used to fit data')
return False
return True
def transform(self, X):
check = self.check_transform(X)
if check==False:
return None
n = X.shape[0]
m = X.shape[1]
self.out_X = []
for i in range(m):
if self.encode[i]==0:
self.out_X.append(X[:,i].reshape(n,1))
continue
n_unique = self.arr_nunique[i]
dic = self.arr_dic[i]
col = np.zeros((n,n_unique))
enc_col = [dic[x] for x in X[:,i]]
col[np.arange(n),enc_col] = 1
self.out_X.append(col)
self.out_X = np.concatenate(self.out_X, axis=1)
return self.out_X
def fit_transform(self, X, thresh):
self.fit(X, thresh)
return self.transform(X)
@rohansingh9001 can you Plz review the above code.
@Player0109 I like your approach, but as far as a proper review is concerned, pasting code in a comment is not the proper way to ask for reviews. Therefore, I request you to submit a Pull Request with these changes. That is how proper reviews made on GitHub.
There are a small few changes I want to request. I will do so once you submit a proper PR. Also, make sure that you follow the PEP8 programming standards. To check if your code is correct, use tools like flake8 on your code. Otherwise, your PR will fail the automated tests we have in place.
Hello, I would like to add One hot Encoding in utils