Jupyter notebook: predicting the energy of a sprint's compounds before they are simulated

MyGithubNotYours commented 3 years ago

Hi guys

@jchodera referred me here. I told him that I think it might be possible to predict energies before compounds are simulated. I suppose that this could be used to prioritize the order in which compounds are simulated. In other words, get good results sooner (hopefully). I'm here to show a proof-of-concept based on simple non-sophisticated effort. If I'm interpreting the results correctly, it appears that MSE is ~3 and classification is ~70% for bad compounds. With better models, I'm sure the results could be improved.

Here's a Jupyter notebook of my findings: https://github.com/MyGithubNotYours/FAH_stuff/blob/master/FAH-rank-molecules.ipynb

Here's a datafile containing some data from sprint 4: https://github.com/MyGithubNotYours/FAH_stuff/blob/master/FAH_results_s4.csv

If you enter the location of the data file at the beginning of the notebook, then the rest of the notebook should run without you needing to change anything. The only changes you might decide to do are: choose between regression mode vs classification mode, and choose which featurizer is being used. In the notebook, I've created 4 versions of the training data - each is generated by using a different featurizer.

Disclaimer:

This was my first time using these libraries, so don't laugh if I used them incorrectly, setup the models incorrectly, or made bad machine learning errors that are causing my false good results :relieved:

Questions:

What do you think is the best way to featurize MPro or its binding pockets/sites? Maybe predictive ability would be increased if such features could be included as input data along with the features of the simulated compounds (and indicate which of the pockets each simulated compound is connecting with).
DeepChem's GraphConvModel uses a lot of GPU memory. I used a loop from this DeepChem tutorial, and it increases the GPU memory usage after every iteration. Eventually, after some number of iterations of this loop, my GPU crashes from memory resource exhaustion. Any ideas on what I'm doing incorrectly? https://github.com/deepchem/deepchem/blob/master/examples/tutorials/04_Introduction_to_Graph_Convolutions.ipynb Here's the loop in question:
```
      num_epochs = 10
      losses = []
      for i in range(num_epochs):
          loss = model.fit_generator(data_generator(train_dataset))
          print("Epoch %d loss: %f" % (i, loss))
          losses.append(loss)
```

EDIT 2020-12-06: The DeepChem issue went away when I upgraded to DeepChem 2.2 and TensorFlow 1.12.

Let me know what you think!

mcwitt commented 3 years ago

Whoa, this looks awesome, thanks for sharing @MyGithubNotYours! :smiley:

One quick suggestion from first skim: you can use the analysis.json file linked on the dashboard to avoid needing to parse HTML (I'm assuming that's what you did here?), and get access to all of the data coming from the analysis (including some raw simulation data).

MyGithubNotYours commented 3 years ago

@mcwitt ahh 'analysis.json' was sitting there under my nose this whole time! Oops haha thanks for pointing that out to me.

No, I didn't parse HTML. I used complete brute force:

I highlighted the table,
I right-clicked the highlighted table
I clicked 'copy'
I pasted into a text editor and saved the file.

Luckily, copying & pasting conserved the table formatting, so Pandas had no problem reading it as a CSV without any work from me.

MyGithubNotYours commented 3 years ago

EDIT 2020-12-06: The DeepChem issue went away when I upgraded to DeepChem 2.2 and TensorFlow 1.12.

choderalab / fah-xchem

Jupyter notebook: predicting the energy of a sprint's compounds before they are simulated #78