Dlux804 / McQuade-Chem-ML

Development of easy to use and reproducible ML scripts for chemistry.
5 stars 1 forks source link

Main.py cleanup, storage for classification workflow, graphical evaluation for single-label classification #76

Closed dickeygh closed 4 years ago

dickeygh commented 4 years ago

Get_Classification.py:

  1. Added elif and an extra else so that the workflow can run for regression and classification with the new main.py.

Analysis.py:

  1. Added plt.show() at line 59 to reset the matplotlib plot, so that the single-label classification workflow can plot its graph successfully when also using impgraph().

Classifiers.py:

  1. Changed 'rf' to 'rfc'

Features.py:

  1. Added declaration of self.selected_feat_string (line 32) to use in train.py.

Models.py:

  1. Changed multi_label_classification_datasets to be self variable.
  2. Added if statement at line 117 so that the workflow in main.py could be simplified.

Storage.py:

  1. Added extra if statement at line 81 so that storage can run properly for multi-label classification.

Train.py:

  1. Added new function to create roc_curve (lines 177-185)
  2. Added lines at 134-141 that create and save a roc graph for single-label classification.

Main.py:

  1. Major changes were made to main.py in this workflow. The changes allow for the entire workflow (classification and regression) to be ran together without prompting the user to select a model type. The workflow has been tested extensively, and both the regression and classification models seem to be running fine. There are several commented blocks of code (lines 23-29, 45-60, 64) that can be used to run specific models/data sets from either the regression or classification workflow.

Overall, this PR adds storage for all classification models and a graphical evaluation for single-label classification models. In addition, the PR simplifies main.py by removing if statements and allowing more continuity between the regression and classification workflows.

pep8speaks commented 4 years ago

Hello @dickeygh! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1:80: E501 line too long (106 > 79 characters) Line 3:80: E501 line too long (87 > 79 characters) Line 5:80: E501 line too long (107 > 79 characters) Line 8:80: E501 line too long (97 > 79 characters) Line 10:1: E302 expected 2 blank lines, found 0 Line 11:80: E501 line too long (119 > 79 characters) Line 14:80: E501 line too long (88 > 79 characters) Line 15:80: E501 line too long (102 > 79 characters) Line 18:80: E501 line too long (89 > 79 characters) Line 19:80: E501 line too long (121 > 79 characters) Line 22:10: E261 at least two spaces before inline comment Line 25:80: E501 line too long (91 > 79 characters) Line 26:80: E501 line too long (112 > 79 characters) Line 33:80: E501 line too long (107 > 79 characters) Line 36:30: W292 no newline at end of file

Line 231:80: E501 line too long (82 > 79 characters) Line 240:9: E303 too many blank lines (2) Line 240:80: E501 line too long (100 > 79 characters) Line 250:80: E501 line too long (105 > 79 characters) Line 253:80: E501 line too long (87 > 79 characters) Line 265:80: E501 line too long (84 > 79 characters) Line 270:28: W292 no newline at end of file

Line 33:56: E261 at least two spaces before inline comment Line 33:80: E501 line too long (158 > 79 characters)

Line 100:80: E501 line too long (91 > 79 characters) Line 114:80: E501 line too long (91 > 79 characters) Line 119:17: E117 over-indented (comment) Line 128:80: E501 line too long (91 > 79 characters)

Line 68:80: E501 line too long (103 > 79 characters)

Line 76:80: E501 line too long (112 > 79 characters) Line 81:80: E501 line too long (112 > 79 characters)

Line 36:80: E501 line too long (89 > 79 characters) Line 106:59: E261 at least two spaces before inline comment Line 106:80: E501 line too long (186 > 79 characters) Line 112:80: E501 line too long (83 > 79 characters) Line 126:13: E303 too many blank lines (2) Line 126:80: E501 line too long (85 > 79 characters) Line 135:13: E303 too many blank lines (2) Line 135:80: E501 line too long (93 > 79 characters) Line 184:80: E501 line too long (99 > 79 characters)

Line 26:80: E501 line too long (80 > 79 characters) Line 27:5: E265 block comment should start with '# ' Line 39:10: E131 continuation line unaligned for hanging indent Line 49:9: E116 unexpected indentation (comment) Line 50:9: E116 unexpected indentation (comment) Line 51:9: E116 unexpected indentation (comment) Line 63:24: E261 at least two spaces before inline comment Line 64:14: E225 missing whitespace around operator Line 64:23: E231 missing whitespace after ',' Line 65:19: E127 continuation line over-indented for visual indent Line 65:28: E261 at least two spaces before inline comment Line 70:80: E501 line too long (150 > 79 characters) Line 71:80: E501 line too long (80 > 79 characters) Line 72:80: E501 line too long (80 > 79 characters) Line 74:80: E501 line too long (119 > 79 characters) Line 75:58: E231 missing whitespace after ',' Line 89:80: E501 line too long (108 > 79 characters) Line 101:80: E501 line too long (88 > 79 characters) Line 203:7: E111 indentation is not a multiple of four Line 203:7: E117 over-indented

Comment last updated at 2020-08-05 21:04:37 UTC
dickeygh commented 4 years ago

I just pushed several changes to this PR that address a lot of the issues with the previous iteration. The Get_Task_Type file was added in order to clean up main.py and perform if statements in a different file. The file allows for main.py to skip over algorithm/data set combinations that are not compatible.

I was unable to accomplish this in a different manner, which is why I created this new file.

Furthermore, I was unable to put lists of targets directly into the dictionary for classification, as this caused errors I was unable to solve later in the workflow. As a result, Get_Classification.py is still being used in order to get the target columns for classification data sets.

These changes also included changing self.regressor to self.estimator, moving the graphing functions for classification from train.py to analysis.py, and improving upon the declaration of self.task_type in models.py, as well as using self.task_type instead of searching through lists of datasets in several parts of the workflow.

Making these changes caused the removal of several TODO's from the code.

Specific changes can be seen within the above commits.

dickeygh commented 4 years ago

I just pushed several commits to this PR:

Analysis.py:

  1. I added comments explaining the purpose of the new graphing functions.
  2. I changed lines 250-256 to make the graph created have a nicer looking format.

Get_Task_Type.py:

  1. Added comments explaining the purpose of each if statement, the checker variable, and the function as a whole.
  2. Adjusted this file to match 'knc' being changed to 'knn' to match regression workflow.
  3. Adjusted this file to match 'svr/svc' being changed to 'svm'.
  4. Added extra if statements and task_type variable that is used in main.py to print out the ML task being performed.

Classifiers.py:

  1. Changed 'knc' to 'knn' to match regression workflow.
  2. Changed 'svc' to 'svm' to match new svm workflow.

Grid.py:

  1. Changed line 259 to be 'svm' to match other changes.

Name.py:

  1. Added 'svm' to algorithm_list (line 68)

Regressors.py:

  1. Changed 'svr' to 'svm' to match new workflow.

Main.py:

  1. Added calling of task_type (line 75) and used at line 84.
dickeygh commented 4 years ago

Just pushed a commit to this PR:

analysis.py:

  1. Removed name variable, as it is not needed for saving the graphs.

name.py:

  1. Removed out-dated names from algorithm_list (line 68)