Closed DhanshreeA closed 1 year ago
Related to #518
/approve
@DhanshreeA ersilia model respository has been successfully created and is available at:
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
README.md
file to accurately describe your modelIf you have any questions, please feel free to open an issue and get support from the community!
Updates:
I got the embeddings for ImageMol by loading the pretrained model provided by the authors, and then creating a Sequential block out of the model's children except the last fully connected classification layer.
model_blocks = list(model.children())
print(f"Total blocks in model: {len(model_blocks)}")
model_embeddings = torch.nn.Sequential(*model_blocks[:-1]) # Remove the final classifier head
print(f"Total blocks in embeddings: {len(list(model_embeddings.children()))}")
Total blocks in model: 10
Total blocks in embeddings: 9
To further validate whether these embeddings are actually meaningful, as suggested by Miquel and Gemma, I carried out some experiments on similar and dissimilar molecules.
Experimental setup
I took the following molecule (input_mol): Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1
(abacavir, anti-viral administered for HIV) and generated molecules similar to it using the Similarity search in ChEMBL, DrugBank and UNPD model from the Ersilia model hub (https://github.com/ersilia-os/eos9c7k)
And for dissimilar molecules, I took 100 SMILES from eml_canonical.csv, since this list has significant overlap with FDA approved drugs that may not be similar to the molecule under consideration.
For both the similar and dissimilar molecules, I generated their MorganFingerprints and ImageMol embeddings. I used the input_mol's MorganFP and ran tanimoto similarity between it and every molecule in the similar and dissimilar sets. Then I generated ImageMol embeddings for the input_mol as well as every molecule in the similar and dissimilar sets and calculated distance metrics for measuring similarity, namely, Euclidean and Manhattan distances. Furthermore, since the scale of the Tanimoto score and distance metrics vary widely sine Tanimoto scores lie betwee [0,1], and distance scores lie between [0, N), I normalized distance scores using min-max normalization to lie between the [0,1] range such that trends, if any, are clearly visible.
Attached below are the results:
Dissimilar Molecules:
Similar Molecules:
Both of these similarity measures operate differently from each other, ie, Tanimoto score increases directly with increasing similarity and vice-versa; whereas distance measures decrease with increasing similarity and vice-versa. Therefore the trends I expected to see in these plots are:
This does appear to be the case, ie for the Similar case, the Tanimoto score distribution roughly lies above the Manhattan and Euclidean score distributions; whereas for the Dissimilar case, the Tanimoto distribution lies roughly below the Manhattan and Euclidean distributions.
Also linking the relevant Colab notebook for reproducing these experiments: https://colab.research.google.com/drive/1GozQNfpYOLWkedfomdu8lb2hoG8REos9?usp=sharing
@GemmaTuron should we close this issue?
Model Name
Molecular representation learning
Model Description
Representation Learning Framework that utilizes molecule images for encoding molecular inputs as machine readable vectors for downstream tasks such as bio-activity prediction, drug metabolism analysis, or drug toxicity prediction. The approach utilizes transfer learning, that is, pre-training the model on massive unlabeled datasets to help it in generalizing feature extraction and then fine tuning on specific tasks.
Slug
image-mol-embeddings
Tag
Embedding
Publication
https://www.nature.com/articles/s42256-022-00557-6
Source code
https://github.com/HongxinXiang/ImageMol
License
MIT License