Closed marta-sd closed 7 years ago
Minor:
[x] You shouldn't take absolute values from correlation coefficients - you loose information this way. The most common approach is to make square of R (R^2)
[x] Pearson R is not always the best, eg. it can overestimate clustered groups of points. Check Sprearman's R and Kendall Tau
[x] All imports should be at the top of notebook.
[x] The headlines need some more comments, especially at the end: why you train LogisticRegression
model?
[x] PEP8 (spaces and comas, assignment), eg.
conc = pd.concat([continuous, data['class of diagnosis']], axis=1)
for col in continuous:
g=sns.FacetGrid(size=5,data=conc, col='class of diagnosis', hue='class of diagnosis')
g.map(sns.distplot, col ,rug=True, kde=True)
encoded_column_names
should be created automatically from
categoricals_BIN_names
, categoricals_non_BIN_names
and categoricals_names
train_test_split
from the last sectiontrain_test_split
here, you already created train
and test
scaler
(scaler.fit(train_continuous)
and then scaler.transform(continuous)
)
Major:
np.concatenate
ornp.hstack
to joinonehot_X
,categorical_binary
andscaled_cont
. Also never hard-code the shapes and indices - what if you get more data or decide to remove one of the variables? Useonehot_X.shape[1]
etc. (but this time just stack the arrays and don't worry about the shape)Minor:
categoricals.groupby(['class of diagnosis'])
in each iteration, you can create groups before the loopplt.legend(title=col, loc='upper right', frameon=True)
at the end of the loop)sns.barplot(x='class of diagnosis', y="percentage", hue=col, data=counts)
), use consistent quoting in all repo (I prefer single quotes)