Dlux804 / McQuade-Chem-ML

Development of easy to use and reproducible ML scripts for chemistry.
5 stars 1 forks source link

Json neo4j #72

Closed qle2 closed 4 years ago

qle2 commented 4 years ago

This PR contains all the files I created to import data, either directly from our core pipeline or from files in zip folders located in output, to Neo4j based on our current ontology (which is subject to change later on). While there are a few things from the ontology that I haven't included, please let me know if you find anything major missing. I also slightly modified some of our core files for my own needs so if I had broken anything, please let me know. Screenshot_2020-07-15 Arrow Tool Current Ontology

Above is the picture of our current ontology.

Steps for test runs:

Open Neo4j Desktop and start up a local graph database. Set your password to 1234 . This is just how I set up my password for quick testing. We can come up with a universal password for all of our graphs later on. I think there's a way to disable passwords but I don't know how. Go to plug-ins and install APOC For test runs: If you want to create graphs directly from the pipeline, run main.py. It can be run from main and from example_model. I suggest you create a small csv file for a quick run and review instead of waiting for 1000+ SMILES and even larger number of features. Once this PR is done, you can run it on the whole file and I'll address the problems that occurred when running in large CSVs. If you want to create graphs from your zip files, go to output_to_neo4j.py, uncomment the last line and run it. Again I suggest you have a small working example for reasons mentioned above.

In the files I created, I included documentation that explains: the objective of the script, the objective of the function and intent behind the function for each file.

Also @dickeygh, while I was designing this script, I only tested it on regression models. Please run classification models on CSVs that don't have any irregular SMILES and let me know if anything breaks

If you have any questions or concerns, I'll be happy to answer them

pep8speaks commented 4 years ago

Hello @qle2! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 222:80: E501 line too long (83 > 79 characters) Line 222:80: E501 line too long (83 > 79 characters) Line 222:84: W292 no newline at end of file Line 222:84: W292 no newline at end of file Line 222:84: W292 no newline at end of file

Line 2:80: E501 line too long (92 > 79 characters) Line 20:80: E501 line too long (109 > 79 characters) Line 30:80: E501 line too long (116 > 79 characters) Line 31:80: E501 line too long (114 > 79 characters) Line 32:80: E501 line too long (118 > 79 characters) Line 36:80: E501 line too long (105 > 79 characters) Line 41:80: E501 line too long (80 > 79 characters) Line 42:80: E501 line too long (112 > 79 characters) Line 43:80: E501 line too long (118 > 79 characters) Line 44:80: E501 line too long (108 > 79 characters) Line 45:80: E501 line too long (117 > 79 characters) Line 60:80: E501 line too long (91 > 79 characters) Line 62:80: E501 line too long (103 > 79 characters) Line 63:80: E501 line too long (86 > 79 characters) Line 70:80: E501 line too long (87 > 79 characters) Line 75:68: W292 no newline at end of file

Line 126:79: W292 no newline at end of file

Line 84:80: E501 line too long (86 > 79 characters) Line 85:80: E501 line too long (90 > 79 characters) Line 89:80: E501 line too long (83 > 79 characters) Line 90:80: E501 line too long (84 > 79 characters) Line 93:80: E501 line too long (122 > 79 characters) Line 96:80: E501 line too long (88 > 79 characters) Line 98:80: E501 line too long (95 > 79 characters) Line 111:20: W292 no newline at end of file

Line 2:80: E501 line too long (111 > 79 characters) Line 20:80: E501 line too long (84 > 79 characters) Line 21:80: E501 line too long (111 > 79 characters) Line 22:80: E501 line too long (112 > 79 characters) Line 28:80: E501 line too long (85 > 79 characters) Line 32:80: E501 line too long (100 > 79 characters) Line 33:80: E501 line too long (91 > 79 characters) Line 39:80: E501 line too long (113 > 79 characters) Line 51:80: E501 line too long (93 > 79 characters) Line 54:1: W293 blank line contains whitespace Line 60:80: E501 line too long (100 > 79 characters) Line 69:80: E501 line too long (83 > 79 characters) Line 70:80: E501 line too long (88 > 79 characters) Line 72:80: E501 line too long (105 > 79 characters) Line 73:80: E501 line too long (119 > 79 characters) Line 74:80: E501 line too long (119 > 79 characters) Line 75:80: E501 line too long (118 > 79 characters) Line 77:80: E501 line too long (103 > 79 characters) Line 78:80: E501 line too long (114 > 79 characters) Line 87:80: E501 line too long (88 > 79 characters) Line 90:80: E501 line too long (90 > 79 characters) Line 92:1: W293 blank line contains whitespace Line 95:80: E501 line too long (98 > 79 characters) Line 102:80: E501 line too long (116 > 79 characters) Line 107:80: E501 line too long (104 > 79 characters) Line 112:80: E501 line too long (103 > 79 characters) Line 115:80: E501 line too long (96 > 79 characters) Line 116:80: E501 line too long (89 > 79 characters) Line 117:80: E501 line too long (86 > 79 characters) Line 122:80: E501 line too long (94 > 79 characters) Line 127:80: E501 line too long (101 > 79 characters) Line 133:80: E501 line too long (85 > 79 characters) Line 138:80: E501 line too long (89 > 79 characters) Line 146:80: E501 line too long (106 > 79 characters) Line 156:80: E501 line too long (110 > 79 characters) Line 157:80: E501 line too long (85 > 79 characters) Line 158:80: E501 line too long (117 > 79 characters) Line 163:80: E501 line too long (102 > 79 characters) Line 169:80: E501 line too long (115 > 79 characters) Line 179:80: E501 line too long (116 > 79 characters) Line 181:1: W391 blank line at end of file

Line 2:80: E501 line too long (115 > 79 characters) Line 20:80: E501 line too long (117 > 79 characters) Line 21:80: E501 line too long (119 > 79 characters) Line 22:80: E501 line too long (88 > 79 characters) Line 42:80: E501 line too long (115 > 79 characters) Line 43:80: E501 line too long (118 > 79 characters) Line 44:80: E501 line too long (120 > 79 characters) Line 45:80: E501 line too long (107 > 79 characters) Line 56:80: E501 line too long (89 > 79 characters) Line 57:80: E501 line too long (85 > 79 characters) Line 58:80: E501 line too long (89 > 79 characters) Line 59:80: E501 line too long (110 > 79 characters) Line 60:80: E501 line too long (120 > 79 characters) Line 62:80: E501 line too long (82 > 79 characters) Line 63:80: E501 line too long (99 > 79 characters) Line 65:80: E501 line too long (89 > 79 characters) Line 66:80: E501 line too long (106 > 79 characters) Line 73:80: E501 line too long (118 > 79 characters) Line 79:80: E501 line too long (87 > 79 characters) Line 86:80: E501 line too long (92 > 79 characters) Line 89:80: E501 line too long (112 > 79 characters)

Line 2:80: E501 line too long (104 > 79 characters) Line 16:80: E501 line too long (87 > 79 characters) Line 22:80: E501 line too long (106 > 79 characters) Line 23:80: E501 line too long (117 > 79 characters) Line 24:80: E501 line too long (109 > 79 characters) Line 26:80: E501 line too long (101 > 79 characters) Line 27:80: E501 line too long (116 > 79 characters) Line 28:80: E501 line too long (115 > 79 characters) Line 29:80: E501 line too long (120 > 79 characters) Line 30:80: E501 line too long (117 > 79 characters) Line 35:80: E501 line too long (109 > 79 characters) Line 36:80: E501 line too long (123 > 79 characters) Line 40:80: E501 line too long (80 > 79 characters) Line 46:80: E501 line too long (81 > 79 characters) Line 51:80: E501 line too long (81 > 79 characters) Line 56:80: E501 line too long (88 > 79 characters) Line 62:80: E501 line too long (83 > 79 characters) Line 63:80: E501 line too long (94 > 79 characters) Line 64:80: E501 line too long (92 > 79 characters) Line 66:80: E501 line too long (109 > 79 characters) Line 67:80: E501 line too long (100 > 79 characters) Line 72:80: E501 line too long (86 > 79 characters) Line 93:41: W292 no newline at end of file

Line 28:80: E501 line too long (90 > 79 characters)

Line 2:80: E501 line too long (119 > 79 characters) Line 17:80: E501 line too long (94 > 79 characters) Line 18:80: E501 line too long (116 > 79 characters) Line 19:80: E501 line too long (114 > 79 characters) Line 20:80: E501 line too long (113 > 79 characters) Line 21:80: E501 line too long (114 > 79 characters) Line 28:80: E501 line too long (117 > 79 characters) Line 30:80: E501 line too long (80 > 79 characters) Line 31:1: W293 blank line contains whitespace Line 33:80: E501 line too long (90 > 79 characters) Line 35:80: E501 line too long (80 > 79 characters) Line 38:80: E501 line too long (84 > 79 characters) Line 40:80: E501 line too long (80 > 79 characters) Line 41:1: W293 blank line contains whitespace Line 45:80: E501 line too long (85 > 79 characters) Line 45:86: W291 trailing whitespace Line 46:80: E501 line too long (119 > 79 characters) Line 49:80: E501 line too long (111 > 79 characters) Line 50:80: E501 line too long (91 > 79 characters) Line 54:80: E501 line too long (98 > 79 characters) Line 56:80: E501 line too long (90 > 79 characters) Line 63:80: E501 line too long (103 > 79 characters) Line 64:80: E501 line too long (100 > 79 characters) Line 65:80: E501 line too long (103 > 79 characters) Line 67:80: E501 line too long (99 > 79 characters) Line 69:80: E501 line too long (87 > 79 characters) Line 73:80: E501 line too long (89 > 79 characters) Line 73:90: W291 trailing whitespace Line 75:80: E501 line too long (85 > 79 characters) Line 76:80: E501 line too long (110 > 79 characters) Line 79:80: E501 line too long (115 > 79 characters) Line 80:80: E501 line too long (111 > 79 characters) Line 81:80: E501 line too long (87 > 79 characters) Line 82:10: E261 at least two spaces before inline comment Line 83:80: E501 line too long (87 > 79 characters) Line 87:80: E501 line too long (93 > 79 characters) Line 92:80: E501 line too long (104 > 79 characters) Line 94:80: E501 line too long (90 > 79 characters) Line 97:80: E501 line too long (103 > 79 characters) Line 97:104: W291 trailing whitespace Line 98:80: E501 line too long (118 > 79 characters) Line 99:80: E501 line too long (98 > 79 characters) Line 100:80: E501 line too long (87 > 79 characters) Line 103:80: E501 line too long (99 > 79 characters) Line 103:100: W291 trailing whitespace Line 104:80: E501 line too long (88 > 79 characters) Line 105:80: E501 line too long (118 > 79 characters) Line 108:80: E501 line too long (112 > 79 characters) Line 110:80: E501 line too long (94 > 79 characters) Line 115:80: E501 line too long (103 > 79 characters) Line 115:104: W291 trailing whitespace Line 117:80: E501 line too long (90 > 79 characters) Line 120:80: E501 line too long (103 > 79 characters) Line 120:104: W291 trailing whitespace Line 121:80: E501 line too long (98 > 79 characters) Line 121:99: W291 trailing whitespace Line 122:80: E501 line too long (89 > 79 characters) Line 123:80: E501 line too long (85 > 79 characters) Line 124:80: E501 line too long (98 > 79 characters) Line 125:80: E501 line too long (97 > 79 characters) Line 130:80: E501 line too long (96 > 79 characters) Line 130:97: W291 trailing whitespace Line 131:80: E501 line too long (90 > 79 characters) Line 133:31: E231 missing whitespace after ':' Line 133:80: E501 line too long (108 > 79 characters) Line 139:80: E501 line too long (103 > 79 characters) Line 139:104: W291 trailing whitespace Line 142:20: E127 continuation line over-indented for visual indent Line 142:80: E501 line too long (105 > 79 characters) Line 143:31: E128 continuation line under-indented for visual indent Line 148:80: E501 line too long (116 > 79 characters) Line 148:117: W291 trailing whitespace Line 149:80: E501 line too long (121 > 79 characters) Line 150:80: E501 line too long (94 > 79 characters) Line 154:80: E501 line too long (107 > 79 characters) Line 154:108: W291 trailing whitespace Line 155:80: E501 line too long (113 > 79 characters) Line 155:114: W291 trailing whitespace Line 157:80: E501 line too long (102 > 79 characters) Line 158:80: E501 line too long (88 > 79 characters) Line 161:80: E501 line too long (94 > 79 characters) Line 161:95: W291 trailing whitespace Line 162:80: E501 line too long (94 > 79 characters) Line 163:80: E501 line too long (115 > 79 characters) Line 168:80: E501 line too long (100 > 79 characters) Line 168:101: W291 trailing whitespace Line 169:80: E501 line too long (104 > 79 characters) Line 170:80: E501 line too long (94 > 79 characters) Line 174:80: E501 line too long (94 > 79 characters) Line 174:95: W291 trailing whitespace Line 177:59: E231 missing whitespace after ':' Line 177:80: E501 line too long (106 > 79 characters) Line 181:77: W292 no newline at end of file

Line 31:80: E501 line too long (80 > 79 characters)

Line 80:80: E501 line too long (80 > 79 characters) Line 93:80: E501 line too long (85 > 79 characters) Line 153:80: E501 line too long (106 > 79 characters) Line 198:6: E114 indentation is not a multiple of four (comment) Line 198:6: E117 over-indented (comment)

Comment last updated at 2020-07-18 13:48:15 UTC
andreshyer commented 4 years ago

Also, would it be possible to put all the neo4j stuff into a seperate directory, outside of core? I would likely help with clutter so we dont have one directory with all the files

andreshyer commented 4 years ago

Also, Hey Quang. I ran into a pretty serious bug. I was running the model model1 = models.MlModel(algorithm='ada', dataset='water-energy.csv', target='expt', feat_meth=[4], tune=True, cv=2, opt_iter=2). The following error came about

KeyError: 'qed' Bayesian Parameter Optimization: 100%|██████████| 2/2 [00:38<00:00, 19.42s/it]

Following the error leads to df_rdkit2d_features = self.data.loc[:, 'smiles':'qed']

The script excepts that the feature method has rdkit2d featurization, because it works fine if I use the model model1 = models.MlModel(algorithm='ada', dataset='water-energy.csv', target='expt', feat_meth=[0, 4], tune=True, cv=2, opt_iter=2)

So simply changing the feat_meth from [4] to [0, 4] ressolves the issue. This shouldnt be that hard of a bug to fix I do not think

Full error: `Traceback (most recent call last): File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4845, in get_slice_bound return self._searchsorted_monotonic(label, side) File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4806, in _searchsorted_monotonic raise ValueError("index must be monotonic increasing or decreasing") ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/user/PycharmProjects/McQuade-Chem-ML/main.py", line 199, in single_model() File "/home/user/PycharmProjects/McQuade-Chem-ML/main.py", line 173, in single_model model1.to_neo4j() File "/home/user/PycharmProjects/McQuade-Chem-ML/core/models.py", line 123, in to_neo4j nodes(self) # Create nodes File "/home/user/PycharmProjects/McQuade-Chem-ML/core/nodes_to_neo4j.py", line 87, in nodes r2, mse, rmse, canonical_smiles, df_smiles, df_rdkit2d_features, test_mol_dict = prep(self) File "/home/user/PycharmProjects/McQuade-Chem-ML/core/nodes_to_neo4j.py", line 32, in prep df_rdkit2d_features = self.data.loc[:, 'smiles':'qed'] File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexing.py", line 1762, in getitem return self._getitem_tuple(key) File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexing.py", line 1289, in _getitem_tuple retval = getattr(retval, self.name)._getitem_axis(key, axis=i) File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexing.py", line 1912, in _getitem_axis return self._get_slice_axis(key, axis=axis) File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexing.py", line 1797, in _get_slice_axis slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4713, in slice_indexer start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind) File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4932, in slice_locs end_slice = self.get_slice_bound(end, "right", kind) File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4848, in get_slice_bound raise err File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4842, in get_slice_bound slc = self.get_loc(label) File "/home/user/miniconda3/envs/mlapp/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'qed' Bayesian Parameter Optimization: 100%|██████████| 2/2 [00:38<00:00, 19.42s/it]`