Closed DhanshreeA closed 1 year ago
Updates: I am currently trying to reproduce the results of ImageMol for the pre-trained models the authors have provided for eight different benchmark datasets. In trying to setup an environment as per the instructions, I am running into issues while installing the following packages: torch-cluster, torch-scatter, and torch-sparse. Pip fails while building wheels or running setup.py for these packages.
Hi @DhanshreeA ! Could this be due to your system setup (using CPU instead of CUDA for example). What is the pip command you are using? And which version of Pytorch are you using? Have you tried manually installing the packages on by one from their binary source (as inidcated for example in https://pypi.org/project/torch-cluster/)
Hi @GemmaTuron thanks for your suggestion about using CPU instead of GPU. I recall that it is one of the constraints for making ersilia accessible in low resource settings. To address your other points:
torch-cluster
, torch-scatter
, and torch-sparse
was similar to the error described in this issue: https://github.com/rusty1s/pytorch_cluster/issues/146HOWEVER, upon inspecting that issue and reading the author's suggestion to upgrade to Python 3.8, it prompted me to go through the release history to see versions of this package that worked with Python 3.7; which is when I also realized that some of the later versions required torch >1.4.0 In the ImageMol repo, these three dependencies in particular are not pinned to any specific version, so I found the last supported release for torch 1.4.0 for each of these three packages and installed those versions. This got the setup to work and I was able to run the pretrain script on the toy dataset.
ImageMol contains several pretrained models as well as datasets to finetune the pretrained network to different activities. We will select models / groups of models and incorporate each one independently in the Hub (opening a new issue) Let's complete this list:
The datasets provided include:
PubChem Data Sets I and II are two-category datasets and both of them include CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4 isoforms. In addition, we also combine the five separate tasks (1A2, 2C9, 2C19, 2D6, and 3A4) of PubChem Data Set I into a multi-labeled classification problem to evaluate the performance of ImageMol in multi-labeled scenario.
Statistical information on CYP450:
Datasets of Compound-Protein Binding Prediction
Datasets on SARS-CoV-2 assays from NCATS OpenData as linked here
Few Notes: While trying to run pre-training or fine-tuning within PyTorch 1.13.0 environment, the code generates a RunTime error:
/home/dee/miniconda3/envs/imagemol2/lib/python3.7/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
0%| | 0/60 [00:02<?, ?it/s]
Traceback (most recent call last):
File "pretrain.py", line 483, in <module>
main(args)
File "pretrain.py", line 395, in main
loss.backward()
File "/home/dee/miniconda3/envs/imagemol2/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/home/dee/miniconda3/envs/imagemol2/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [512]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Interestingly, this is not the case when running finetuning/pre-training in a PyTorch 1.4.0 environment, which is the version the authors originally trained it with. To debug which tensor operation is causing this, we would have to go through the code in depth and figure out which tensor operation is causing it.
For now I have only modified the code to work on CPU on this fork: https://github.com/DhanshreeA/ImageMol
To fine tune the model on the provided MPP/sider toy data set, run:
python finetune.py --gpu 0 \
--save_finetune_ckpt 1 \
--log_dir ./logs/toxcast \
--dataroot ./datasets/toy/finetuning/MPP \
--dataset sider \
--task_type classification \
--resume ./ckpts/ImageMol.pth.tar \
--image_aug \
--lr 0.5 \
--batch 64 \
--epochs 20
Hi @DhanshreeA !
Great start thanks. So, what we suggest is:
Thanks @GemmaTuron. That sounds good I shall go ahead and create separate issues for HIV and BACE and get started on those.
@GemmaTuron Issue for HIV model https://github.com/ersilia-os/ersilia/issues/532 and BACE model: https://github.com/ersilia-os/ersilia/issues/533
Running fine tuning on my machine on toy SIDER dataset (1427 input compounds) took ~48 minutes. The pre trained ImageMol I used was also trained on a toy dataset so that explains why the ROCAUC values look like this:
final results: highest_valid: 0.544, final_train: 0.590, final_test: 0.531
Ok interesting, do you think Image Mol would be easy to run on Colab? we could grant you GPUs there. I think we can even do a test for free, they do give some gpu time, let me know if this sounds feasible or a lot of work for a test!
Ok interesting, do you think Image Mol would be easy to run on Colab? we could grant you GPUs there. I think we can even do a test for free, they do give some gpu time, let me know if this sounds feasible or a lot of work for a test!
I can try it out and let you know. So far I don't think it should be difficult.
Updates: HIV and BACE models are incorporated and ready to be tested.
Next steps: GPCR and Sars-CoV2 assays which are both being formulated as multi task problems.
Issues tracking GPCR model #571 and SARS-CoV-2 model #572
I'll close this issue as model development is being tracked in the linked issues
Model Name
ImageMol
Model Description
Representation Learning Framework that utilizes molecule images for encoding molecular inputs as machine readable vectors for downstream tasks such as bio-activity prediction, drug metabolism analysis, or drug toxicity prediction. The approach utilizes transfer learning, that is, pre-training the model on massive unlabeled datasets to help it in generalizing feature extraction and then fine tuning on specific tasks.
Slug
image-mol
Tags
transfer-learning, pre-training, self-supervision, classification
Publication
Original Paper: https://www.nature.com/articles/s42256-022-00557-6
Supplementary Materials: https://static-content.springer.com/esm/art%3A10.1038%2Fs42256-022-00557-6/MediaObjects/42256_2022_557_MOESM1_ESM.pdf
Code
https://github.com/HongxinXiang/ImageMol
License
MIT License