In the code where you've shown how to apply stratified k-fold cross validation to a regression problem, I noticed a small bug.
# we create a new column called kfold and fill it with -1
data["kfold"] = -1
# the next step is to randomize the rows of the data
data = data.sample(frac=1).reset_index(drop=True)
# calculate the number of bins by Sturge's rule
# I take the floor of the value, you can also
# just round it
num_bins = np.floor(1 + np.log2(len(data)))
# bin targets
data.loc[:, "bins"] = pd.cut(
data["target"], bins=num_bins, labels=False
)
# initiate the kfold class from model_selection module
kf = model_selection.StratifiedKFold(n_splits=5)
# fill the new kfold column
# note that, instead of targets, we use bins!
for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
data.loc[v_, 'kfold'] = f
# drop the bins column
data = data.drop("bins", axis=1)
# return dataframe with folds
return data
The bug is in this line :
num_bins = np.floor(1 + np.log2(len(data)))
num_bins is of type numpy.floa64
And when this is used in segregating targets into bins (in the next part of the code), it throws an error
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
The code in the pdf and the book for that line is written as:
num_bins = int(np.floor(1 + np.log2(len(data))))
by added the int, you won't encounter that issue as well.
In the code where you've shown how to apply stratified k-fold cross validation to a regression problem, I noticed a small bug.
The bug is in this line :
num_bins = np.floor(1 + np.log2(len(data)))
num_bins
is of type numpy.floa64And when this is used in segregating targets into bins (in the next part of the code), it throws an error
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Proposed solution: