ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
203 stars 131 forks source link

🦠 Model Request: ImageMol Embeddings #567

Closed DhanshreeA closed 1 year ago

DhanshreeA commented 1 year ago

Model Name

Molecular representation learning

Model Description

Representation Learning Framework that utilizes molecule images for encoding molecular inputs as machine readable vectors for downstream tasks such as bio-activity prediction, drug metabolism analysis, or drug toxicity prediction. The approach utilizes transfer learning, that is, pre-training the model on massive unlabeled datasets to help it in generalizing feature extraction and then fine tuning on specific tasks.

Slug

image-mol-embeddings

Tag

Embedding

Publication

https://www.nature.com/articles/s42256-022-00557-6

Source code

https://github.com/HongxinXiang/ImageMol

License

MIT License

DhanshreeA commented 1 year ago

Related to #518

GemmaTuron commented 1 year ago

/approve

github-actions[bot] commented 1 year ago

New Model Repository Created! 🎉

@DhanshreeA ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos4avb

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

DhanshreeA commented 1 year ago

Updates:

I got the embeddings for ImageMol by loading the pretrained model provided by the authors, and then creating a Sequential block out of the model's children except the last fully connected classification layer.

model_blocks = list(model.children())
print(f"Total blocks in model: {len(model_blocks)}")
model_embeddings = torch.nn.Sequential(*model_blocks[:-1]) # Remove the final classifier head
print(f"Total blocks in embeddings: {len(list(model_embeddings.children()))}")

Total blocks in model: 10
Total blocks in embeddings: 9

To further validate whether these embeddings are actually meaningful, as suggested by Miquel and Gemma, I carried out some experiments on similar and dissimilar molecules.

Experimental setup

I took the following molecule (input_mol): Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1 (abacavir, anti-viral administered for HIV) and generated molecules similar to it using the Similarity search in ChEMBL, DrugBank and UNPD model from the Ersilia model hub (https://github.com/ersilia-os/eos9c7k) And for dissimilar molecules, I took 100 SMILES from eml_canonical.csv, since this list has significant overlap with FDA approved drugs that may not be similar to the molecule under consideration.

For both the similar and dissimilar molecules, I generated their MorganFingerprints and ImageMol embeddings. I used the input_mol's MorganFP and ran tanimoto similarity between it and every molecule in the similar and dissimilar sets. Then I generated ImageMol embeddings for the input_mol as well as every molecule in the similar and dissimilar sets and calculated distance metrics for measuring similarity, namely, Euclidean and Manhattan distances. Furthermore, since the scale of the Tanimoto score and distance metrics vary widely sine Tanimoto scores lie betwee [0,1], and distance scores lie between [0, N), I normalized distance scores using min-max normalization to lie between the [0,1] range such that trends, if any, are clearly visible.

Attached below are the results:

Dissimilar Molecules: dissimilar

Similar Molecules: similar

Both of these similarity measures operate differently from each other, ie, Tanimoto score increases directly with increasing similarity and vice-versa; whereas distance measures decrease with increasing similarity and vice-versa. Therefore the trends I expected to see in these plots are:

This does appear to be the case, ie for the Similar case, the Tanimoto score distribution roughly lies above the Manhattan and Euclidean score distributions; whereas for the Dissimilar case, the Tanimoto distribution lies roughly below the Manhattan and Euclidean distributions.

Also linking the relevant Colab notebook for reproducing these experiments: https://colab.research.google.com/drive/1GozQNfpYOLWkedfomdu8lb2hoG8REos9?usp=sharing

DhanshreeA commented 1 year ago

PR: https://github.com/ersilia-os/eos4avb/pull/1

DhanshreeA commented 1 year ago

Currently resolving: https://github.com/ersilia-os/eos4avb/issues/3 https://github.com/ersilia-os/eos4avb/issues/4

DhanshreeA commented 1 year ago

@GemmaTuron should we close this issue?