ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: THEMBO JONATHAN #1000

Closed jona42-ui closed 3 months ago

jona42-ui commented 3 months ago

Week 1 - Get to know the community

Week 2 - Get Familiar with Machine Learning for Chemistry

Week 3 - Validate a Model in the Wild

Week 4 - Prepare your final application

jona42-ui commented 3 months ago

Motivation statement

I am Thembo Jonathan, a finalist 4th year Software Engineering, Makerere University Kampala, Uganda. I am learner at heart and passionate about contributing to solutions for global good. Contextually, the health sector is at the core of my career because I want to write code to save lives. That's me.

Ersilia without a shadow of a doubt is the best fit to horn the skills especially through the revolutionary age of AI/ML. AI/ML found its wide adoption at the right time when young brains like mine have nothing to think about but rather utilizing this knowledge to change the story.

As a software engineering student, traditionally I have gone through all what it takes in technology stacks but now picking up AI for industrial work is my goal and aim for this amazing project, Ersilia.

I am not planning to leave this project after internship because it will be part of me.

I cant wait to expand the Model Hub for more usecases.

I write code to save lives

jona42-ui commented 3 months ago

This is Task 1 week2; https://github.com/jona42-ui/eos4tcc-model-validation

@DhanshreeA

DhanshreeA commented 3 months ago

Hi @jona42-ui good work so far, however I am having trouble understanding your code. I see that you are using model's predicted probabilities and a measure of aleatoric uncertainty, however I do not see the code where you actually generate the predictions and save to your module_predictions.csv file. Could you please update the code so that it is easier to follow?

jona42-ui commented 3 months ago

Thanks for the timely feedback. Do you mean literally including how I ran the run ersilia API after serving the model?

jona42-ui commented 3 months ago

Thanks @DhanshreeA updated as requested here; https://github.com/jona42-ui/eos4tcc-model-validation/blob/main/notebooks/00_model_bias.ipynb

Running the models in the notebook is too annoying so I have referenced how I ran the predictions from my host system in the terminal. Am not sure if this suffices

Thanks for the mentorship and time

jona42-ui commented 3 months ago

Hello @DhanshreeA I have started on task 2 Week 2; https://github.com/jona42-ui/eos4tcc-model-validation/blob/main/notebooks/01_model_reproducibility.ipynb

However; I have a blocker, the Model BayesHERG utilizes pytorch that uses Nvidia GPU accelerator, CUBA. Checking my system requirements, I seem to have AMD GPU which does not work well with CUBA.

So on running predictions, I am facing an error that the cuba.so.i library is not installed yet all dependencies were seemingly installed which confirms my concern above. OSError: libcuda.so.1: cannot open shared object file: No such file or directory

Is there a work around you would suggest or I need to change the Host system. cc: @GemmaTuron

Malikbadmus commented 3 months ago

@jona42-ui , How about trying a cpu workaround, let Pytorch make use of CPU instead of attempting to use GPU.

jona42-ui commented 3 months ago

@jona42-ui , How about trying a cpu workaround, let Pytorch make use of CPU instead of attempting to use GPU.

actually CPU is a default without any flag, so still the problem persists for whatever reason.

DhanshreeA commented 3 months ago

@jona42-ui , How about trying a cpu workaround, let Pytorch make use of CPU instead of attempting to use GPU.

actually CPU is a default without any flag, so still the problem persists for whatever reason.

Hi @jona42-ui , all Ersilia models work with CPU by default, if you check the model dependencies for each model, they do not make use of GPU versions of ML libraries eg Pytorch.

Can you explain to me what you're running, how you are running it and what errors are you getting?

jona42-ui commented 3 months ago

@jona42-ui , How about trying a cpu workaround, let Pytorch make use of CPU instead of attempting to use GPU.

actually CPU is a default without any flag, so still the problem persists for whatever reason.

Hi @jona42-ui , all Ersilia models work with CPU by default, if you check the model dependencies for each model, they do not make use of GPU versions of ML libraries eg Pytorch.

Can you explain to me what you're running, how you are running it and what errors are you getting?

Thanks @DhanshreeA using this https://github.com/GIST-CSBL/BayeshERG

running this; python main.py -i data/External/EX1.csv -o EX1_pred -c cpu -t 30 After installing all dependencies and activating the virtual env

error is; https://pastebin.com/bfrgVJuQ full terminal output (BayeshERG) thembo@workspace:~/BayeshERG$ python main.py -i data/External/EX1.csv -o EX1_pred -c cpu -t 30 Using backend: pytorch Traceback (most recent call last): File "main.py", line 7, in import dgl File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/init.py", line 8, in from .backend import load_backend, backend_name File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/backend/init.py", line 74, in load_backend(get_preferred_backend()) File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/backend/init.py", line 24, in load_backend mod = importlib.import_module('.%s' % mod_name, name) File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/backend/pytorch/init.py", line 1, in from .tensor import * File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/backend/pytorch/tensor.py", line 10, in from ... import ndarray as nd File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/ndarray.py", line 14, in from ._ffi.object import register_object, ObjectBase File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/_ffi/object.py", line 8, in from .object_generic import ObjectGeneric, convert_to_object File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/_ffi/object_generic.py", line 7, in from .base import string_types File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/_ffi/base.py", line 42, in _LIB, _LIB_NAME = _load_lib() File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/site-packages/dgl/_ffi/base.py", line 34, in _load_lib lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL) File "/home/thembo/miniconda3/envs/BayeshERG/lib/python3.6/ctypes/init.py", line 348, in init self._handle = _dlopen(self._name, mode) OSError: libcuda.so.1: cannot open shared object file: No such file or directory

I am doing all this in the terminal

DhanshreeA commented 3 months ago

Hi @jona42-ui is this issue still persisting? If yes, could you please share with me the dependencies in your environment?

jona42-ui commented 3 months ago

Hello @DhanshreeA , thanks for reaching out, just a few minutes a go I was able to go over this issue, I was about to update here.

So, it turns out that the standard repository shared in the paper(eos4tcc) has not had update for like 2 years, and its the one I cloned to experiment with the datasets that were used to train nad make predictions.

many of the classes were deprecated and out of date, test I was meant to follow its READ me to make the predictions. I realised the model sourcecode within Ersilia eos4tcc model has up to date files, so when I updated the sourcefiles the error was resolved.

I am not sure if I was the one missing the workflow, because my aim was to reproduce the predictions using the original datasets and model that is not hosted on EMH(ersilia model hub)

jona42-ui commented 3 months ago

forexample the main.py files; one containes dgl(that actually appears in the error above) which is nolonger supported and was moved to dgllife from here;

contains up to date files https://github.com/ersilia-os/eos4tcc/blob/main/model/framework/code/main.py#L16

contains deprecated files https://github.com/GIST-CSBL/BayeshERG/blob/78f7654e480009df48a89fc78f6a2ef81d519e71/main.py#L12

jona42-ui commented 3 months ago

Not sure if this makes sense

jona42-ui commented 3 months ago

Update on week 2 Task2:

Update here: https://github.com/jona42-ui/eos4tcc-model-validation/blob/main/notebooks/01_model_reproducibility.ipynb

Basically, the data was collected hERG-related data from various sources and built two datasets for different tasks: pretraining and fine-tuning. The pretraining set is a regression task that predicts the hERG channel inhibition percentage at 10 μM, and the fine-tuning set is a classification task that predicts the IC50 -derived hERG channel blockers at a threshold of 10 μM.

But for genralisation of the model performance, they prepared two additional external test sets. The first external test set was from Ryu et al. [8], which contained 30 hERG positives and 14 hERG negatives based on an IC50 threshold of 10 μM. This is the dataset I have used throughout this task.

For the predictions and out put as described here, using // With CPU $ python main.py -i data/External/EX1.csv -o EX1_pred -c cpu -t 30

The output;

image

is obtained with the metric measurements as described here

jona42-ui commented 3 months ago

@DhanshreeA there is this piece that is missing in the predictions obtained from the Ersilia model hub, its called label, for example, the labels are binary (1 for active, 0 for inactive) used to assess the performance and accuracy of the model in distinguishing between active and inactive compounds. tracing back this should have originated from the use of the reference dataset used.

looking at the external dataset from the BayeshERG model, the label is included, which provides the label to serve as ground truth data for training and evaluating predictive sets.

Screenshot from 2024-03-20 07-19-59

examining this reproducibility notebook, while BayeshERG showed competitive performance metrics, the absence of labeled data in the Ersilia Model Hub hindered direct performance comparison.

DhanshreeA commented 3 months ago

Hello @DhanshreeA , thanks for reaching out, just a few minutes a go I was able to go over this issue, I was about to update here.

So, it turns out that the standard repository shared in the paper(eos4tcc) has not had update for like 2 years, and its the one I cloned to experiment with the datasets that were used to train nad make predictions.

many of the classes were deprecated and out of date, test I was meant to follow its READ me to make the predictions. I realised the model sourcecode within Ersilia eos4tcc model has up to date files, so when I updated the sourcefiles the error was resolved.

I am not sure if I was the one missing the workflow, because my aim was to reproduce the predictions using the original datasets and model that is not hosted on EMH(ersilia model hub)

No, you were following the workflow correctly. Part of implementing models within the EMH is to update dependencies and make the models more in tune with current state of the relevant Python versions and libraries. In this case, I think it was okay to cross reference with the implementation in the EMH. We mainly want to see effort!

DhanshreeA commented 3 months ago

@DhanshreeA there is this piece that is missing in the predictions obtained from the Ersilia model hub, its called label, for example, the labels are binary (1 for active, 0 for inactive) used to assess the performance and accuracy of the model in distinguishing between active and inactive compounds. tracing back this should have originated from the use of the reference dataset used.

looking at the external dataset from the BayeshERG model, the label is included, which provides the label to serve as ground truth data for training and evaluating predictive sets.

Screenshot from 2024-03-20 07-19-59

examining this reproducibility notebook, while BayeshERG showed competitive performance metrics, the absence of labeled data in the Ersilia Model Hub hindered direct performance comparison.

@jona42-ui the label is not 'missing' from the EMH. It is intentional. If you look at the interpretation in the README, it says "Interpretation: Probability of hERG channel blockade. The cut-off used in the training set to define hERG blockade was IC50 <= 10 μM". We want to provide only the probability because different users of the model might want to use custom probability thresholds for binarizing the outcome.

DhanshreeA commented 3 months ago

@jona42-ui I don't understand in your plots from task 2, why is the bar width for the BayeshERG model so different from the EMH implementation? As your final task, can you explain (and/or fix this)?

I will be reviewing finally on Monday. If by then you cannot pick up task 3, that's okay. We will move on to the final application. Thank you for the work so far.

jona42-ui commented 3 months ago

@jona42-ui I don't understand in your plots from task 2, why is the bar width for the BayeshERG model so different from the EMH implementation? As your final task, can you explain (and/or fix this)?

I will be reviewing finally on Monday. If by then you cannot pick up task 3, that's okay. We will move on to the final application. Thank you for the work so far.

my bad! thanks for the catch @DhanshreeA , its turns out that I needed a more intuitive visualization. the main problem was on how plain histogram plots represent data. data distribution is key here for a more comparable task. looking at seaborn's KDE( kernel density estimate): https://seaborn.pydata.org/generated/seaborn.kdeplot.html , It really addresses the challenge estimating the underlying distribution of a dataset without making any assumptions about its functional form.

Technically, KDE works by placing a kernel (often a Gaussian) at each data point and summing up these kernels to create a smooth estimate of the underlying distribution. This smooth estimate provides a continuous probability density function, which can be visualized as a smooth curve.

we can now take a look on the side by side fix reproducibility notebook fix

jona42-ui commented 3 months ago

Week 3 TASK UPDATE:

To increase its(BayeshERG) predictive power, the authors pretrained a bayesian graph neural network with 300,000 molecules as a transfer learning exercise.

As I searched for a dataset to validate eos4tcc , the hERG Central as described here contains 306,893 drugs.

justification for experimental analysis

Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating. Thus, if a drug blocks the hERG, it could lead to severe adverse effects. Therefore, reliable prediction of hERG liability in the early stages of drug design is quite important to reduce the risk of cardiotoxicity-related attritions in the later development stages. There are three targets: hERG_at_1uM, hERG_at_10uM, and hERG_inhib

I mainly used hERG_inhib: Binary classification. Given a drug SMILES string, predict whether it blocks (1) or not blocks (0). This is equivalent to whether hERG_at_10uM < -50, i.e. whether the compound has an IC50 of less than 10µM.

After ascertaining the experimental analysis, out of the 306,893 molecules, I checked for data leakage and it had no overlap with the training set used on EHM.

making predictions

I split the large data set into 70% and 15%, 15 %, training and validation , test data. Screenshot from 2024-03-25 07-17-18 On, the 15% , validation data set, the prediction continued endlessly so I obtained a validation set of 1000 molecules to provide a fair prediction Screenshot from 2024-03-25 07-21-08

Performance metrics

I mainly looked at the Receiver Operating Characteristic (ROC) Curve of EHM on external dataset and the PCA of Molecule Fingerprints on the training set and the external set to evaluate the overlap in the chemical space:

ROC-AUC

AUC = 0.491 ± 0.043 while the model shows some capability in predicting hERG channel blockade, its performance, as indicated by the ROC-AUC score, suggests that there is room for improvement to make it more reliable and useful for practical applications

PAC

From the visualization there is an overlap between the training set and the external dataset which suggests hat the two sets have similar chemical properties or are located in similar regions of chemical space.

the note book with updated visualizations can be seen here: https://github.com/jona42-ui/eos4tcc-model-validation/blob/main/notebooks/02_external_validation.ipynb

GemmaTuron commented 3 months ago

Hello @jona42-ui

Where did you get the testing data from? it is not clear to me from your comment. Please comment on that and then move onto preparing the final application.

jona42-ui commented 3 months ago

Thanks so much @GemmaTuron it was basically from the ML Task called Toxicity having a datasets index of hERG central of the Therapeutic Data commons foundation.

jona42-ui commented 3 months ago

It's a single instance prediction. from the documentation link I shared above it provides a means of importing the dataset that I showed in the notebook.

putting it here for posterity;

from tdc.itils import retrieve_label_name_list

from tdc.single_pred import Tox data = Tox(name='herg_central', label_name = hERG_inhib).get_data()

jona42-ui commented 3 months ago

I am not sure if I answered your question right @GemmaTuron , I feel it's not sufficient.

GemmaTuron commented 3 months ago

Hi @jona42-ui

Yes thanks, I am well familiar with the TDC package so all is clear :)

DhanshreeA commented 3 months ago

Looks good @jona42-ui, the model indeed doesn't do well on external data. Good to know. Please work on your final application now! :)

jona42-ui commented 3 months ago

Thanks so much @DhanshreeA @GemmaTuron for the insightful guidance all the way.