🦠 Model Request: ImageMol - Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework

DhanshreeA commented 1 year ago

Model Name

ImageMol

Model Description

Representation Learning Framework that utilizes molecule images for encoding molecular inputs as machine readable vectors for downstream tasks such as bio-activity prediction, drug metabolism analysis, or drug toxicity prediction. The approach utilizes transfer learning, that is, pre-training the model on massive unlabeled datasets to help it in generalizing feature extraction and then fine tuning on specific tasks.

Slug

image-mol

Publication

Original Paper: https://www.nature.com/articles/s42256-022-00557-6

Supplementary Materials: https://static-content.springer.com/esm/art%3A10.1038%2Fs42256-022-00557-6/MediaObjects/42256_2022_557_MOESM1_ESM.pdf

Code

https://github.com/HongxinXiang/ImageMol

License

MIT License

DhanshreeA commented 1 year ago

Updates: I am currently trying to reproduce the results of ImageMol for the pre-trained models the authors have provided for eight different benchmark datasets. In trying to setup an environment as per the instructions, I am running into issues while installing the following packages: torch-cluster, torch-scatter, and torch-sparse. Pip fails while building wheels or running setup.py for these packages.

GemmaTuron commented 1 year ago

Hi @DhanshreeA ! Could this be due to your system setup (using CPU instead of CUDA for example). What is the pip command you are using? And which version of Pytorch are you using? Have you tried manually installing the packages on by one from their binary source (as inidcated for example in https://pypi.org/project/torch-cluster/)

DhanshreeA commented 1 year ago

Hi @GemmaTuron thanks for your suggestion about using CPU instead of GPU. I recall that it is one of the constraints for making ersilia accessible in low resource settings. To address your other points:

I had set up a Python 3.7 conda environment with latest pip version 22.3.1; the Pytorch version I had installed was 1.4.0 as per ImageMol's instructions.
The error that I got for torch-cluster, torch-scatter, and torch-sparse was similar to the error described in this issue: https://github.com/rusty1s/pytorch_cluster/issues/146
I had tried installing the packages one by one, but at that point I had not tried building them from source.

HOWEVER, upon inspecting that issue and reading the author's suggestion to upgrade to Python 3.8, it prompted me to go through the release history to see versions of this package that worked with Python 3.7; which is when I also realized that some of the later versions required torch >1.4.0 In the ImageMol repo, these three dependencies in particular are not pinned to any specific version, so I found the last supported release for torch 1.4.0 for each of these three packages and installed those versions. This got the setup to work and I was able to run the pretrain script on the toy dataset.

GemmaTuron commented 1 year ago

ImageMol contains several pretrained models as well as datasets to finetune the pretrained network to different activities. We will select models / groups of models and incorporate each one independently in the Hub (opening a new issue) Let's complete this list:

Sars-CoV2 (13 models): needs to be finetuned
...

DhanshreeA commented 1 year ago

Continuing @GemmaTuron's comment above.

ImageMol authors provide pretrained models for eight benchmark datasets (binary classification tasks) from MoleculeNet:

The following table contains basic statistical information about these datasets Screenshot from 2023-01-11 20-40-16

DhanshreeA commented 1 year ago

The datasets provided include:

Drug Metabolism: Cytochrome P450 inhibitor binary classification task. They use Pub Chem Data Set I and II from https://pubmed.ncbi.nlm.nih.gov/21491913/. From the supplementary doc:

PubChem Data Sets I and II are two-category datasets and both of them include CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4 isoforms. In addition, we also combine the five separate tasks (1A2, 2C9, 2C19, 2D6, and 3A4) of PubChem Data Set I into a multi-labeled classification problem to evaluate the performance of ImageMol in multi-labeled scenario.

Statistical information on CYP450: Screenshot from 2023-01-11 20-47-04

DhanshreeA commented 1 year ago

Datasets of Compound-Protein Binding Prediction

Top 10 GPCR datasets with the largest number of reported ligands from ChEMBL database:

Screenshot from 2023-01-11 20-50-49

Ten KinomeScan datasets from Library of Integrated network-based cellular signatures (LINCS):

Screenshot from 2023-01-11 20-51-30

DhanshreeA commented 1 year ago

Datasets on SARS-CoV-2 assays from NCATS OpenData as linked here Screenshot from 2023-01-11 20-59-44

DhanshreeA commented 1 year ago

Few Notes: While trying to run pre-training or fine-tuning within PyTorch 1.13.0 environment, the code generates a RunTime error:

/home/dee/miniconda3/envs/imagemol2/lib/python3.7/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
  0%|                                                                                         | 0/60 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "pretrain.py", line 483, in <module>
    main(args)
  File "pretrain.py", line 395, in main
    loss.backward()
  File "/home/dee/miniconda3/envs/imagemol2/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
    self, gradient, retain_graph, create_graph, inputs=inputs
  File "/home/dee/miniconda3/envs/imagemol2/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [512]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Interestingly, this is not the case when running finetuning/pre-training in a PyTorch 1.4.0 environment, which is the version the authors originally trained it with. To debug which tensor operation is causing this, we would have to go through the code in depth and figure out which tensor operation is causing it.

For now I have only modified the code to work on CPU on this fork: https://github.com/DhanshreeA/ImageMol

DhanshreeA commented 1 year ago

To fine tune the model on the provided MPP/sider toy data set, run:

python finetune.py --gpu 0 \
                   --save_finetune_ckpt 1 \
                   --log_dir ./logs/toxcast \
                   --dataroot ./datasets/toy/finetuning/MPP \
                   --dataset sider \
                   --task_type classification \
                   --resume ./ckpts/ImageMol.pth.tar \
                   --image_aug \
                   --lr 0.5 \
                   --batch 64 \
                   --epochs 20

GemmaTuron commented 1 year ago

Hi @DhanshreeA !

Great start thanks. So, what we suggest is:

Adding the HIV and the BACE models from MoleculeNet (these are the only datasets from MoleculeNet for which we don't have any model)
Working on the SarsCov2 datasets. It would be ideal to combine the different models and provide the result as a multi output classification. This will be quite a lot of work, when you know how much does training a toy model require we can try to allocate GPU's for you, either through Colab or through our workstation
Adding the GPCR models

DhanshreeA commented 1 year ago

Thanks @GemmaTuron. That sounds good I shall go ahead and create separate issues for HIV and BACE and get started on those.

DhanshreeA commented 1 year ago

@GemmaTuron Issue for HIV model https://github.com/ersilia-os/ersilia/issues/532 and BACE model: https://github.com/ersilia-os/ersilia/issues/533

DhanshreeA commented 1 year ago

Running fine tuning on my machine on toy SIDER dataset (1427 input compounds) took ~48 minutes. The pre trained ImageMol I used was also trained on a toy dataset so that explains why the ROCAUC values look like this:

final results: highest_valid: 0.544, final_train: 0.590, final_test: 0.531

GemmaTuron commented 1 year ago

Ok interesting, do you think Image Mol would be easy to run on Colab? we could grant you GPUs there. I think we can even do a test for free, they do give some gpu time, let me know if this sounds feasible or a lot of work for a test!

DhanshreeA commented 1 year ago

Ok interesting, do you think Image Mol would be easy to run on Colab? we could grant you GPUs there. I think we can even do a test for free, they do give some gpu time, let me know if this sounds feasible or a lot of work for a test!

I can try it out and let you know. So far I don't think it should be difficult.

DhanshreeA commented 1 year ago

Updates: HIV and BACE models are incorporated and ready to be tested.

Next steps: GPCR and Sars-CoV2 assays which are both being formulated as multi task problems.

DhanshreeA commented 1 year ago

Issues tracking GPCR model #571 and SARS-CoV-2 model #572

GemmaTuron commented 1 year ago

I'll close this issue as model development is being tracked in the linked issues

ersilia-os / ersilia