🦠 Model Request: Fine Tuned ImageMol Model for HIV Dataset from MoleculeNet

DhanshreeA commented 1 year ago

Model Name

imagemol-hiv

Model Description

Representation Learning Framework that utilizes molecule images for encoding molecular inputs as machine readable vectors for downstream tasks such as bio-activity prediction, drug metabolism analysis, or drug toxicity prediction. The approach utilizes transfer learning, that is, pre-training the model on massive unlabeled datasets to help it in generalizing feature extraction and then fine tuning on specific tasks.

HIV (Human Immunodeficiency Virus) dataset contains more than40,000 records of whether the compound inhibits HIV replication for binary classification between active and inactive.

Slug

hiv-replication-prediction

Publication

Original Paper: https://www.nature.com/articles/s42256-022-00557-6

Supplementary Materials: https://static-content.springer.com/esm/art%3A10.1038%2Fs42256-022-00557-6/MediaObjects/42256_2022_557_MOESM1_ESM.pdf

Code

https://github.com/HongxinXiang/ImageMol

Checkpoints: https://drive.google.com/file/d/1NOj3Hr36bbn6POdcFBciCbb5ilVz2440/view?usp=sharing

Parent Issue: https://github.com/ersilia-os/ersilia/issues/518

License

No response

GemmaTuron commented 1 year ago

/approve

github-actions[bot] commented 1 year ago

New Model Repository Created! 🎉

@DhanshreeA ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos6hy3

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

🍴 Get started by creating a fork of your new model repository - docs
👯 Clone your forked repository - docs
✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the README.md file to accurately describe your model
- Add source code for your model
- Adding documentation for your model
🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

DhanshreeA commented 1 year ago

I have gotten the model to run with Ersilia by copying the pre-processing steps and model inference from the original codebase into the main.py file within the model repo. An "issue" I was facing earlier while running this was misclassification of positive examples. However, I was only testing this with few (~20) samples but it's worth keeping in mind that the original dataset has ~41k samples and is also highly imbalanced. Moreover, the model only has 80% ROCAUC. It just so happened that the few positive samples I was running it on were being misclassified. Here are a few things I did to confirm whether the incorporated model works:

I tried running their eval script from the model codebase on the pre-processed data they have provided for a subset of 200 samples and still got all negative classification (so some false negatives essentially) I increased this to 500 samples and finally got a few true positive classified samples. (edited) I then tried the model code I had extracted into the model repo with the same 500 samples (but now directly the raw dataset to confirm if I got the pre-processing right) and it's exactly the same result.

DhanshreeA commented 1 year ago

These are the outputs produced on raw dataset (+ incorporated model) and processed dataset (+original model). If you do a simple file diff on these two files, you will find there is no difference between them. It's the same when comparing them as dataframes (DataFrame.compare) raw_output.csv proc_output.csv

DhanshreeA commented 1 year ago

I believe it's the same case with BACE as well #533 I will validate this and confirm.

DhanshreeA commented 1 year ago

@GemmaTuron This model is ready to be tested by others in the team.

GemmaTuron commented 1 year ago

@DhanshreeA what is the output of this model? is this a multiclassification (CA, CI, CM) or simply 0 and 1? Are you giving a probability? See the original dataset: https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#hiv-datasets

DhanshreeA commented 1 year ago

Hi @GemmaTuron this is binary classification. The authors have provided a readme with the processed version of this data and they're modelling it as CA+CM = 1 and CI = 0.

To quote the authors:

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).

The data file contains a csv table, in which columns below are used: "smiles" - SMILES representation of the molecular structure "activity" - Three-class labels for screening results: CI/CM/CA "HIV_active" - Binary labels for screening results: 1 (CA/CM) and 0 (CI)

GemmaTuron commented 1 year ago

Thanks @DhanshreeA

This kind of information should be in the interpretation of the models otherwise end users will not know how to understand the models ;)

DhanshreeA commented 1 year ago

That makes sense @GemmaTuron I'll update the README/Issue description with this info. :grimacing:

GemmaTuron commented 1 year ago

Hi @DhanshreeA !

The workflow is failing at the same point as BACE (#533) when testing the model automatically (see Action)

DhanshreeA commented 1 year ago

Hi @GemmaTuron, same as BACE, I am able to fetch this locally (without the repo_path flag). Here are the logs: eos6hy3.log I went through the action but I am not sure why this error is coming in there. Immediately the only difference in the action and on my local I see is between the Info from BentoML during fetch :

Action:

 Info {'name': 'eos6hy3', 'version': '20230124081353_373659', 'created_at': '2023-01-24T08:13:53.422948Z', 'env': {'conda_env': 'name: bentoml-default-conda-env\nchannels:\n- defaults\ndependencies: []\n', 'python_version': '3.7.16', 'docker_base_image': 'bentoml/model-server:0.11.0-py37', 'pip_packages': ['bentoml==0.11.0']}, 'artifacts': [{'name': 'model', 'artifact_type': 'Artifact', 'metadata': {}}], 'apis': [{'name': 'predict', 'input_type': 'JsonInput', 'docs': "BentoService inference API 'predict', input: 'JsonInput', output: 'DefaultOutput'", 'output_config': {'cors': '*'}, 'output_type': 'DefaultOutput', 'mb_max_latency': 10000, 'mb_max_batch_size': 2000, 'batch': True}]}

Local:

Info {'name': 'eos6hy3', 'version': '20230124150531_7DD0E8', 'created_at': '2023-01-24T09:35:32.592455Z', 'env': {'conda_env': 'name: bentoml-default-conda-env\nchannels:\n- defaults\ndependencies: []\n', 'python_version': '3.7.13', 'docker_base_image': 'bentoml/model-server:0.11.0-py37', 'pip_packages': ['bentoml==0.11.0']}, 'artifacts': [{'name': 'model', 'artifact_type': 'Artifact', 'metadata': {}}], 'apis': [{'name': 'predict', 'input_type': 'JsonInput', 'docs': "BentoService inference API 'predict', input: 'JsonInput', output: 'DefaultOutput'", 'output_config': {'cors': '*'}, 'output_type': 'DefaultOutput', 'mb_max_latency': 10000, 'mb_max_batch_size': 2000, 'batch': True}]}

So the Python version is at 3.7.16 on the action and on my local it is 3.7.13

DhanshreeA commented 1 year ago

@GemmaTuron I propose we have these models tested by at least another contributor so we can pin point if the issue is within the action?

GemmaTuron commented 1 year ago

Thanks for checking, the python version shouldn't be a problem. Let's see if the others encounter the errors at model testing or not.

GemmaTuron commented 1 year ago

@GemmaTuron I propose we have these models tested by at least another contributor so we can pin point if the issue is within the action?

You read my mind! They are already assigned for testing so we will know soon ;)

DhanshreeA commented 1 year ago

@GemmaTuron The latest pull+install of Ersilia CLI flags a checksum discrepancy for me while fetching the HIV model:

15:19:34 | ERROR    | ❌ Checksum discrepancy in file model/checkpoints/hiv.pth: expected a90582417ba1d64492b77c659b89ff03ef8f2133ab247bd8a6c8cdb5b8d48dfd, actual 23ef7d689d27e20be74c8c8a2cd4421767af9371135e7fd6ee09a86a7800ec0a

However the model is still fetched. I am guessing this should be a cause for concern? Where is ersilia getting the expected checksum from?

GemmaTuron commented 1 year ago

If I'm not wrong this is a new function that @miquelduranfrigola is implementing, to fetch models from our AWS S3 instead of Git LFS (still in progress). If the model is not in s3 it simply indicates it and continues with the clone from gitlfs

ersilia-os / ersilia