abhishekkrthakur / approachingalmost

Approaching (Almost) Any Machine Learning Problem
7.41k stars 1.06k forks source link

A small bug on page 28. #23

Open deepansh96 opened 4 years ago

deepansh96 commented 4 years ago

In the code where you've shown how to apply stratified k-fold cross validation to a regression problem, I noticed a small bug.

 # we create a new column called kfold and fill it with -1
 data["kfold"] = -1

 # the next step is to randomize the rows of the data
 data = data.sample(frac=1).reset_index(drop=True)

 # calculate the number of bins by Sturge's rule
 # I take the floor of the value, you can also
 # just round it
   num_bins = np.floor(1 + np.log2(len(data)))

 # bin targets
 data.loc[:, "bins"] = pd.cut(
 data["target"], bins=num_bins, labels=False
 )

 # initiate the kfold class from model_selection module
 kf = model_selection.StratifiedKFold(n_splits=5)

 # fill the new kfold column
 # note that, instead of targets, we use bins!
 for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
 data.loc[v_, 'kfold'] = f

 # drop the bins column
 data = data.drop("bins", axis=1)
 # return dataframe with folds
 return data

The bug is in this line : num_bins = np.floor(1 + np.log2(len(data)))

num_bins is of type numpy.floa64

And when this is used in segregating targets into bins (in the next part of the code), it throws an error TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.

Proposed solution:

   num_bins = num_bins.asype(int)
DarkArcZ commented 3 years ago

The code in the pdf and the book for that line is written as: num_bins = int(np.floor(1 + np.log2(len(data)))) by added the int, you won't encounter that issue as well.