Use a proportional feature graph to show the difference in the feature compare scores/rankings to show a better story about the importance/predictive quality of each feature.
Breaking up dplyr code in data cleaning with comments: does it make readability more difficult?
Use case_when instead of using nested if_else statements
Prediction approach/methods section in the final report that lays out the whole plan of analysis attack in a few quick bullet points.
Cross validation: the test data we shouldn't use in cross validation, only the training data because CV is only used to pick a max depth. As it stands right now, our test data is influencing our fit because we used the test data depth score, not the training data. The cross validation function breaks the training data into train and validation groups within the function, we don't pass any test data into it. We need to remove this line and see if it influences max depth and pick max depth based on the training data CV score. Code is in script file 3:
depth_range = range(1,10)
train_cv = []
test_cv = []
for d in depth_range:
model = DecisionTreeClassifier(max_depth=d)
train_cv.append(np.mean(cross_val_score(model, X_train, y_train, cv=10)))
--test_cv.append(np.mean(cross_val_score(model, X_test, y_test, cv=10)))-- REMOVE
max_cv = max(train_cv)
opt_d = train_cv.index(max_cv)
Use a proportional feature graph to show the difference in the feature compare scores/rankings to show a better story about the importance/predictive quality of each feature.
Breaking up dplyr code in data cleaning with comments: does it make readability more difficult?
Use
case_when
instead of using nestedif_else
statementsPrediction approach/methods section in the final report that lays out the whole plan of analysis attack in a few quick bullet points.
Cross validation: the test data we shouldn't use in cross validation, only the training data because CV is only used to pick a max depth. As it stands right now, our test data is influencing our fit because we used the test data depth score, not the training data. The cross validation function breaks the training data into train and validation groups within the function, we don't pass any test data into it. We need to remove this line and see if it influences max depth and pick max depth based on the training data CV score. Code is in script file 3:
Feedback from Tony