Dlux804 / McQuade-Chem-ML

Development of easy to use and reproducible ML scripts for chemistry.
5 stars 1 forks source link

Added sklearn classification pipeline to main workflow. #48

Closed dickeygh closed 4 years ago

dickeygh commented 4 years ago

Created new file called Classifiers.py, works similarly to regressors.py.

Added a new function called classification_metrics (line 316) to analysis.py. This function computes and returns several different evaluation metrics of the model's performance on the test set.

Added classification algorithms svc and kneighborsclassifier to the list of models that need feature normalization in features.py (line 23).

Changed main.py to run either the classification models or the regression models given a user input.

Changed line 17 of name.py to fit my directory.

Added a classification_run function to models.py that runs the classification models and prints evaluation results.

Changed ingest.py line 4 so that drop is not always=true. This is because some of the classification models run better on some data sets without dropping the extra columns. (I made drop=True when running regression models, see main.py line 64.)

Added sider data set to dataFiles.

Currently have 4 evaluation methods (accuracy score, confusion matrix, roc auc score, and classification report), 1 data set (sider), and 2 different models (SVC and Kneighborsclassifier). I plan on adding more of all of these.

pep8speaks commented 4 years ago

Hello @dickeygh! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 6:80: E501 line too long (128 > 79 characters) Line 314:1: E303 too many blank lines (3) Line 314:80: E501 line too long (115 > 79 characters) Line 315:1: E302 expected 2 blank lines, found 3 Line 322:56: W292 no newline at end of file

Line 5:1: E302 expected 2 blank lines, found 1

Line 4:1: E302 expected 2 blank lines, found 1

Line 5:80: E501 line too long (80 > 79 characters) Line 116:9: E265 block comment should start with '# ' Line 161:5: E303 too many blank lines (2) Line 165:80: E501 line too long (158 > 79 characters) Line 165:93: E231 missing whitespace after ',' Line 165:106: E261 at least two spaces before inline comment Line 171:80: E501 line too long (168 > 79 characters) Line 171:120: E261 at least two spaces before inline comment

Line 15:80: E501 line too long (90 > 79 characters) Line 25:80: E501 line too long (86 > 79 characters) Line 27:80: E501 line too long (91 > 79 characters) Line 38:80: E501 line too long (86 > 79 characters) Line 40:80: E501 line too long (87 > 79 characters) Line 49:24: E261 at least two spaces before inline comment Line 51:78: E261 at least two spaces before inline comment Line 51:80: E501 line too long (183 > 79 characters) Line 53:80: E501 line too long (183 > 79 characters) Line 53:91: E261 at least two spaces before inline comment Line 58:80: E501 line too long (80 > 79 characters) Line 64:80: E501 line too long (81 > 79 characters) Line 81:17: E303 too many blank lines (3) Line 81:80: E501 line too long (84 > 79 characters) Line 87:80: E501 line too long (81 > 79 characters) Line 88:80: E262 inline comment should start with '# ' Line 88:80: E501 line too long (158 > 79 characters)

Comment last updated at 2020-06-03 20:58:34 UTC
qle2 commented 4 years ago

It's well put together overall. Just some minor naming convention stuff (This guide is pretty helpful: https://www.python.org/dev/peps/pep-0008/#naming-conventions). I like how well documented your functions are scripts are.

dickeygh commented 4 years ago

I made the following changes to this pull request on 6/3: Moved importing of classification evaluation metrics to the top of the analysis.py file.

Added an extra classifier (RandomForestClassifier) to classifiers.py. This also resulted in small changes on line 23 of features.py and line 21 of main.py.

Changed classifiers.py to be lowercase.

Changed the variable "c" to be lowercase in main.py. Changed the input statement to prompt for c and r instead of classification and regression. (Line 16)

An extra note about the circleci tests: The first set of commits failed the circleci test because classifiers.py was still uppercase on GitHub (pushing filenames to GitHub doesn't seem to be case sensitive). After manually changing it to be lowercase on GitHub, the circleci test was passed. Changed line 17 of name.py to work across different machines.