Closed DhanshreeA closed 1 year ago
/approve
@DhanshreeA ersilia model respository has been successfully created and is available at:
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
README.md
file to accurately describe your modelIf you have any questions, please feel free to open an issue and get support from the community!
I have gotten the model to run with Ersilia by copying the pre-processing steps and model inference from the original codebase into the main.py file within the model repo. An "issue" I was facing earlier while running this was misclassification of positive examples. However, I was only testing this with few (~20) samples but it's worth keeping in mind that the original dataset has ~41k samples and is also highly imbalanced. Moreover, the model only has 80% ROCAUC. It just so happened that the few positive samples I was running it on were being misclassified. Here are a few things I did to confirm whether the incorporated model works:
I tried running their eval script from the model codebase on the pre-processed data they have provided for a subset of 200 samples and still got all negative classification (so some false negatives essentially) I increased this to 500 samples and finally got a few true positive classified samples. (edited) I then tried the model code I had extracted into the model repo with the same 500 samples (but now directly the raw dataset to confirm if I got the pre-processing right) and it's exactly the same result.
These are the outputs produced on raw dataset (+ incorporated model) and processed dataset (+original model).
If you do a simple file diff on these two files, you will find there is no difference between them. It's the same when comparing them as dataframes (DataFrame.compare
)
raw_output.csv
proc_output.csv
I believe it's the same case with BACE as well #533 I will validate this and confirm.
@GemmaTuron This model is ready to be tested by others in the team.
@DhanshreeA what is the output of this model? is this a multiclassification (CA, CI, CM) or simply 0 and 1? Are you giving a probability? See the original dataset: https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#hiv-datasets
Hi @GemmaTuron this is binary classification. The authors have provided a readme with the processed version of this data and they're modelling it as CA+CM = 1 and CI = 0.
To quote the authors:
The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).
The data file contains a csv table, in which columns below are used: "smiles" - SMILES representation of the molecular structure "activity" - Three-class labels for screening results: CI/CM/CA "HIV_active" - Binary labels for screening results: 1 (CA/CM) and 0 (CI)
Thanks @DhanshreeA
This kind of information should be in the interpretation of the models otherwise end users will not know how to understand the models ;)
That makes sense @GemmaTuron I'll update the README/Issue description with this info. :grimacing:
Hi @DhanshreeA !
The workflow is failing at the same point as BACE (#533) when testing the model automatically (see Action)
Hi @GemmaTuron, same as BACE, I am able to fetch this locally (without the repo_path flag). Here are the logs: eos6hy3.log I went through the action but I am not sure why this error is coming in there. Immediately the only difference in the action and on my local I see is between the Info from BentoML during fetch :
Action:
Info {'name': 'eos6hy3', 'version': '20230124081353_373659', 'created_at': '2023-01-24T08:13:53.422948Z', 'env': {'conda_env': 'name: bentoml-default-conda-env\nchannels:\n- defaults\ndependencies: []\n', 'python_version': '3.7.16', 'docker_base_image': 'bentoml/model-server:0.11.0-py37', 'pip_packages': ['bentoml==0.11.0']}, 'artifacts': [{'name': 'model', 'artifact_type': 'Artifact', 'metadata': {}}], 'apis': [{'name': 'predict', 'input_type': 'JsonInput', 'docs': "BentoService inference API 'predict', input: 'JsonInput', output: 'DefaultOutput'", 'output_config': {'cors': '*'}, 'output_type': 'DefaultOutput', 'mb_max_latency': 10000, 'mb_max_batch_size': 2000, 'batch': True}]}
Local:
Info {'name': 'eos6hy3', 'version': '20230124150531_7DD0E8', 'created_at': '2023-01-24T09:35:32.592455Z', 'env': {'conda_env': 'name: bentoml-default-conda-env\nchannels:\n- defaults\ndependencies: []\n', 'python_version': '3.7.13', 'docker_base_image': 'bentoml/model-server:0.11.0-py37', 'pip_packages': ['bentoml==0.11.0']}, 'artifacts': [{'name': 'model', 'artifact_type': 'Artifact', 'metadata': {}}], 'apis': [{'name': 'predict', 'input_type': 'JsonInput', 'docs': "BentoService inference API 'predict', input: 'JsonInput', output: 'DefaultOutput'", 'output_config': {'cors': '*'}, 'output_type': 'DefaultOutput', 'mb_max_latency': 10000, 'mb_max_batch_size': 2000, 'batch': True}]}
So the Python version is at 3.7.16 on the action and on my local it is 3.7.13
@GemmaTuron I propose we have these models tested by at least another contributor so we can pin point if the issue is within the action?
Thanks for checking, the python version shouldn't be a problem. Let's see if the others encounter the errors at model testing or not.
@GemmaTuron I propose we have these models tested by at least another contributor so we can pin point if the issue is within the action?
You read my mind! They are already assigned for testing so we will know soon ;)
@GemmaTuron The latest pull+install of Ersilia CLI flags a checksum discrepancy for me while fetching the HIV model:
15:19:34 | ERROR | ❌ Checksum discrepancy in file model/checkpoints/hiv.pth: expected a90582417ba1d64492b77c659b89ff03ef8f2133ab247bd8a6c8cdb5b8d48dfd, actual 23ef7d689d27e20be74c8c8a2cd4421767af9371135e7fd6ee09a86a7800ec0a
However the model is still fetched. I am guessing this should be a cause for concern? Where is ersilia getting the expected checksum from?
If I'm not wrong this is a new function that @miquelduranfrigola is implementing, to fetch models from our AWS S3 instead of Git LFS (still in progress). If the model is not in s3 it simply indicates it and continues with the clone from gitlfs
Model Name
imagemol-hiv
Model Description
Representation Learning Framework that utilizes molecule images for encoding molecular inputs as machine readable vectors for downstream tasks such as bio-activity prediction, drug metabolism analysis, or drug toxicity prediction. The approach utilizes transfer learning, that is, pre-training the model on massive unlabeled datasets to help it in generalizing feature extraction and then fine tuning on specific tasks.
HIV (Human Immunodeficiency Virus) dataset contains more than40,000 records of whether the compound inhibits HIV replication for binary classification between active and inactive.
Slug
hiv-replication-prediction
Tags
classification
Publication
Original Paper: https://www.nature.com/articles/s42256-022-00557-6
Supplementary Materials: https://static-content.springer.com/esm/art%3A10.1038%2Fs42256-022-00557-6/MediaObjects/42256_2022_557_MOESM1_ESM.pdf
Code
https://github.com/HongxinXiang/ImageMol
Checkpoints: https://drive.google.com/file/d/1NOj3Hr36bbn6POdcFBciCbb5ilVz2440/view?usp=sharing
Parent Issue: https://github.com/ersilia-os/ersilia/issues/518
License
No response