Revision for "Malignant mesothelioma analysis.ipynb"

marta-sd commented 7 years ago

Major:

[x] use np.concatenate or np.hstack to join onehot_X, categorical_binary and scaled_cont. Also never hard-code the shapes and indices - what if you get more data or decide to remove one of the variables? Use onehot_X.shape[1] etc. (but this time just stack the arrays and don't worry about the shape)
[x] where and how do you create data_encoded.csv, test_data.csv and train_data.csv? I don't see it anywhere in the code. Also remember to scale the variables based on the training set only.

Minor:

[x] use markdown to add links
[x] you don't need to import matplotlib - matplotlib.pyplot is enough

[x] fix indentation in cell with frequencies of categorical variables, it should be:

for col in categoricals.drop('class of diagnosis', axis=1):
counts = (categoricals
          .groupby(['class of diagnosis'])[col]
          .value_counts(normalize=True)
          ...)

[x] also, there's no need to repeat categoricals.groupby(['class of diagnosis']) in each iteration, you can create groups before the loop
[x] add frames to legends to make them always readable (e.g. see plot with pleural thickness on tomography). Also try to put them in the same position for related plots (e.g. for these barplots you can add something like plt.legend(title=col, loc='upper right', frameon=True) at the end of the loop)
[x] don't mix double and single quotes (sns.barplot(x='class of diagnosis', y="percentage", hue=col, data=counts)), use consistent quoting in all repo (I prefer single quotes)
[x] filter out deprecation warnings (you can do it globally, just create a filter at the beginning of your notebook)
[x] round values when printing tables, they would be more readable with 1-3 digits after the decimal point instead of 6
[x] it's not obvious which correlation coefficient you use, add this information to the sentence in which you specify the threshold
[x] cells with regplots for correlated variables shouldn't be before "Prepare data for future processing"?

mwojcikowski commented 7 years ago

Minor:

[x] You shouldn't take absolute values from correlation coefficients - you loose information this way. The most common approach is to make square of R (R^2)
[x] Pearson R is not always the best, eg. it can overestimate clustered groups of points. Check Sprearman's R and Kendall Tau
[x] All imports should be at the top of notebook.
[x] The headlines need some more comments, especially at the end: why you train LogisticRegression model?

[x] PEP8 (spaces and comas, assignment), eg.

conc = pd.concat([continuous, data['class of diagnosis']], axis=1)
for col in continuous:
g=sns.FacetGrid(size=5,data=conc, col='class of diagnosis', hue='class of diagnosis')
g.map(sns.distplot, col ,rug=True, kde=True)

marta-sd commented 7 years ago

[x] update description for "Select correlated variables"
[x] encoded_column_names should be created automatically from categoricals_BIN_names, categoricals_non_BIN_names and categoricals_names
[x] move all imports to the firs cell and remove the redundant train_test_split from the last section
[x] also, you shouldn't use train_test_split here, you already created train and test
[x] use training set only to prepare the scaler (scaler.fit(train_continuous) and then scaler.transform(continuous))

cheminfIBB / MM-diagnosis

Revision for "Malignant mesothelioma analysis.ipynb" #1