Closed jona42-ui closed 3 months ago
Motivation statement
I am Thembo Jonathan, a finalist 4th year Software Engineering, Makerere University Kampala, Uganda. I am learner at heart and passionate about contributing to solutions for global good. Contextually, the health sector is at the core of my career because I want to write code to save lives. That's me.
Ersilia without a shadow of a doubt is the best fit to horn the skills especially through the revolutionary age of AI/ML. AI/ML found its wide adoption at the right time when young brains like mine have nothing to think about but rather utilizing this knowledge to change the story.
As a software engineering student, traditionally I have gone through all what it takes in technology stacks but now picking up AI for industrial work is my goal and aim for this amazing project, Ersilia.
I am not planning to leave this project after internship because it will be part of me.
I cant wait to expand the Model Hub for more usecases.
I write code to save lives
This is Task 1 week2; https://github.com/jona42-ui/eos4tcc-model-validation
@DhanshreeA
Hi @jona42-ui good work so far, however I am having trouble understanding your code. I see that you are using model's predicted probabilities and a measure of aleatoric uncertainty, however I do not see the code where you actually generate the predictions and save to your module_predictions.csv
file. Could you please update the code so that it is easier to follow?
Thanks for the timely feedback.
Do you mean literally including how I ran the run
ersilia API after serving the model?
Thanks @DhanshreeA updated as requested here; https://github.com/jona42-ui/eos4tcc-model-validation/blob/main/notebooks/00_model_bias.ipynb
Running the models in the notebook is too annoying so I have referenced how I ran the predictions from my host system in the terminal. Am not sure if this suffices
Thanks for the mentorship and time
Hello @DhanshreeA I have started on task 2 Week 2; https://github.com/jona42-ui/eos4tcc-model-validation/blob/main/notebooks/01_model_reproducibility.ipynb
However; I have a blocker, the Model BayesHERG utilizes pytorch that uses Nvidia GPU accelerator, CUBA. Checking my system requirements, I seem to have AMD GPU which does not work well with CUBA.
So on running predictions, I am facing an error that the cuba.so.i library is not installed yet all dependencies were seemingly installed which confirms my concern above.
OSError: libcuda.so.1: cannot open shared object file: No such file or directory
Is there a work around you would suggest or I need to change the Host system. cc: @GemmaTuron
@jona42-ui , How about trying a cpu workaround, let Pytorch make use of CPU instead of attempting to use GPU.
@jona42-ui , How about trying a cpu workaround, let Pytorch make use of CPU instead of attempting to use GPU.
actually CPU is a default without any flag, so still the problem persists for whatever reason.
@jona42-ui , How about trying a cpu workaround, let Pytorch make use of CPU instead of attempting to use GPU.
actually CPU is a default without any flag, so still the problem persists for whatever reason.
Hi @jona42-ui , all Ersilia models work with CPU by default, if you check the model dependencies for each model, they do not make use of GPU versions of ML libraries eg Pytorch.
Can you explain to me what you're running, how you are running it and what errors are you getting?
@jona42-ui , How about trying a cpu workaround, let Pytorch make use of CPU instead of attempting to use GPU.
actually CPU is a default without any flag, so still the problem persists for whatever reason.
Hi @jona42-ui , all Ersilia models work with CPU by default, if you check the model dependencies for each model, they do not make use of GPU versions of ML libraries eg Pytorch.
Can you explain to me what you're running, how you are running it and what errors are you getting?
Thanks @DhanshreeA using this https://github.com/GIST-CSBL/BayeshERG
running this;
python main.py -i data/External/EX1.csv -o EX1_pred -c cpu -t 30
After installing all dependencies and activating the virtual env
error is; https://pastebin.com/bfrgVJuQ
full terminal output
(BayeshERG) thembo@workspace:~/BayeshERG$ python main.py -i data/External/EX1.csv -o EX1_pred -c cpu -t 30
Using backend: pytorch
Traceback (most recent call last):
File "main.py", line 7, in
I am doing all this in the terminal
Hi @jona42-ui is this issue still persisting? If yes, could you please share with me the dependencies in your environment?
Hello @DhanshreeA , thanks for reaching out, just a few minutes a go I was able to go over this issue, I was about to update here.
So, it turns out that the standard repository shared in the paper(eos4tcc) has not had update for like 2 years, and its the one I cloned to experiment with the datasets that were used to train nad make predictions.
many of the classes were deprecated and out of date, test I was meant to follow its READ me to make the predictions. I realised the model sourcecode within Ersilia eos4tcc model has up to date files, so when I updated the sourcefiles the error was resolved.
I am not sure if I was the one missing the workflow, because my aim was to reproduce the predictions using the original datasets and model that is not hosted on EMH(ersilia model hub)
forexample the main.py files; one containes dgl(that actually appears in the error above) which is nolonger supported and was moved to dgllife from here;
contains up to date files https://github.com/ersilia-os/eos4tcc/blob/main/model/framework/code/main.py#L16
contains deprecated files https://github.com/GIST-CSBL/BayeshERG/blob/78f7654e480009df48a89fc78f6a2ef81d519e71/main.py#L12
Not sure if this makes sense
Update on week 2 Task2:
Update here: https://github.com/jona42-ui/eos4tcc-model-validation/blob/main/notebooks/01_model_reproducibility.ipynb
Basically, the data was collected hERG-related data from various sources and built two datasets for different tasks: pretraining and fine-tuning. The pretraining set is a regression task that predicts the hERG channel inhibition percentage at 10 μM, and the fine-tuning set is a classification task that predicts the IC50 -derived hERG channel blockers at a threshold of 10 μM.
But for genralisation of the model performance, they prepared two additional external test sets. The first external test set was from Ryu et al. [8], which contained 30 hERG positives and 14 hERG negatives based on an IC50 threshold of 10 μM. This is the dataset I have used throughout this task.
For the predictions and out put as described here, using // With CPU $ python main.py -i data/External/EX1.csv -o EX1_pred -c cpu -t 30
The output;
is obtained with the metric measurements as described here
@DhanshreeA there is this piece that is missing in the predictions obtained from the Ersilia model hub, its called label, for example, the labels are binary (1 for active, 0 for inactive) used to assess the performance and accuracy of the model in distinguishing between active and inactive compounds. tracing back this should have originated from the use of the reference dataset used.
looking at the external dataset from the BayeshERG model, the label is included, which provides the label to serve as ground truth data for training and evaluating predictive sets.
examining this reproducibility notebook, while BayeshERG showed competitive performance metrics, the absence of labeled data in the Ersilia Model Hub hindered direct performance comparison.
Hello @DhanshreeA , thanks for reaching out, just a few minutes a go I was able to go over this issue, I was about to update here.
So, it turns out that the standard repository shared in the paper(eos4tcc) has not had update for like 2 years, and its the one I cloned to experiment with the datasets that were used to train nad make predictions.
many of the classes were deprecated and out of date, test I was meant to follow its READ me to make the predictions. I realised the model sourcecode within Ersilia eos4tcc model has up to date files, so when I updated the sourcefiles the error was resolved.
I am not sure if I was the one missing the workflow, because my aim was to reproduce the predictions using the original datasets and model that is not hosted on EMH(ersilia model hub)
No, you were following the workflow correctly. Part of implementing models within the EMH is to update dependencies and make the models more in tune with current state of the relevant Python versions and libraries. In this case, I think it was okay to cross reference with the implementation in the EMH. We mainly want to see effort!
@DhanshreeA there is this piece that is missing in the predictions obtained from the Ersilia model hub, its called label, for example, the labels are binary (1 for active, 0 for inactive) used to assess the performance and accuracy of the model in distinguishing between active and inactive compounds. tracing back this should have originated from the use of the reference dataset used.
looking at the external dataset from the BayeshERG model, the label is included, which provides the label to serve as ground truth data for training and evaluating predictive sets.
examining this reproducibility notebook, while BayeshERG showed competitive performance metrics, the absence of labeled data in the Ersilia Model Hub hindered direct performance comparison.
@jona42-ui the label is not 'missing' from the EMH. It is intentional. If you look at the interpretation in the README, it says "Interpretation: Probability of hERG channel blockade. The cut-off used in the training set to define hERG blockade was IC50 <= 10 μM". We want to provide only the probability because different users of the model might want to use custom probability thresholds for binarizing the outcome.
@jona42-ui I don't understand in your plots from task 2, why is the bar width for the BayeshERG model so different from the EMH implementation? As your final task, can you explain (and/or fix this)?
I will be reviewing finally on Monday. If by then you cannot pick up task 3, that's okay. We will move on to the final application. Thank you for the work so far.
@jona42-ui I don't understand in your plots from task 2, why is the bar width for the BayeshERG model so different from the EMH implementation? As your final task, can you explain (and/or fix this)?
I will be reviewing finally on Monday. If by then you cannot pick up task 3, that's okay. We will move on to the final application. Thank you for the work so far.
my bad! thanks for the catch @DhanshreeA , its turns out that I needed a more intuitive visualization. the main problem was on how plain histogram plots represent data. data distribution is key here for a more comparable task. looking at seaborn's KDE( kernel density estimate): https://seaborn.pydata.org/generated/seaborn.kdeplot.html , It really addresses the challenge estimating the underlying distribution of a dataset without making any assumptions about its functional form.
Technically, KDE works by placing a kernel (often a Gaussian) at each data point and summing up these kernels to create a smooth estimate of the underlying distribution. This smooth estimate provides a continuous probability density function, which can be visualized as a smooth curve.
we can now take a look on the side by side fix reproducibility notebook fix
To increase its(BayeshERG) predictive power, the authors pretrained a bayesian graph neural network with 300,000 molecules as a transfer learning exercise.
As I searched for a dataset to validate eos4tcc , the hERG Central as described here contains 306,893 drugs.
Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating. Thus, if a drug blocks the hERG, it could lead to severe adverse effects. Therefore, reliable prediction of hERG liability in the early stages of drug design is quite important to reduce the risk of cardiotoxicity-related attritions in the later development stages. There are three targets: hERG_at_1uM, hERG_at_10uM, and hERG_inhib
I mainly used hERG_inhib: Binary classification. Given a drug SMILES string, predict whether it blocks (1) or not blocks (0). This is equivalent to whether hERG_at_10uM < -50, i.e. whether the compound has an IC50 of less than 10µM.
After ascertaining the experimental analysis, out of the 306,893 molecules, I checked for data leakage and it had no overlap with the training set used on EHM.
I split the large data set into 70% and 15%, 15 %, training and validation , test data.
On, the 15% , validation data set, the prediction continued endlessly so I obtained a validation set of 1000 molecules to provide a fair prediction
I mainly looked at the Receiver Operating Characteristic (ROC) Curve of EHM on external dataset and the PCA of Molecule Fingerprints on the training set and the external set to evaluate the overlap in the chemical space:
AUC = 0.491 ± 0.043 while the model shows some capability in predicting hERG channel blockade, its performance, as indicated by the ROC-AUC score, suggests that there is room for improvement to make it more reliable and useful for practical applications
From the visualization there is an overlap between the training set and the external dataset which suggests hat the two sets have similar chemical properties or are located in similar regions of chemical space.
the note book with updated visualizations can be seen here: https://github.com/jona42-ui/eos4tcc-model-validation/blob/main/notebooks/02_external_validation.ipynb
Hello @jona42-ui
Where did you get the testing data from? it is not clear to me from your comment. Please comment on that and then move onto preparing the final application.
Thanks so much @GemmaTuron it was basically from the ML Task called Toxicity having a datasets index of hERG central of the Therapeutic Data commons foundation.
It's a single instance prediction. from the documentation link I shared above it provides a means of importing the dataset that I showed in the notebook.
putting it here for posterity;
from tdc.itils import retrieve_label_name_list
from tdc.single_pred import Tox data = Tox(name='herg_central', label_name = hERG_inhib).get_data()
I am not sure if I answered your question right @GemmaTuron , I feel it's not sufficient.
Hi @jona42-ui
Yes thanks, I am well familiar with the TDC package so all is clear :)
Looks good @jona42-ui, the model indeed doesn't do well on external data. Good to know. Please work on your final application now! :)
Thanks so much @DhanshreeA @GemmaTuron for the insightful guidance all the way.
Week 1 - Get to know the community
Week 2 - Get Familiar with Machine Learning for Chemistry
Week 3 - Validate a Model in the Wild
Week 4 - Prepare your final application