✍️ Contribution period: Sharon_Atieno

ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.

https://ersilia.io

GNU General Public License v3.0

198 stars 128 forks source link

✍️ Contribution period: Sharon_Atieno #1003

Closed atienosonia closed 4 months ago

atienosonia commented 5 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[X] Install the Ersilia Model Hub and test the simplest model
[x] Install Docker if needed, and test another model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Get Familiar with Machine Learning for Chemistry

[x] Select a model from the list suggested in GitBook
[x] Download and serve the model via the Ersilia Model Hub to ensure it works
[x] Open a repository on your GitHub user with all the necessary files
[x] Select and clean a dataset of 1000 molecules (example notebook 1)
[x] Run predictions for the molecules on the selected model and evaluate the results

Week 3 - Validate a Model in the Wild

[ ] Find a suitable dataset with sufficient experimental results
[ ] Clean and standardize the dataset
[ ] Run predictions and calculate metrics.

Week 4 - Prepare your final application

[ ] Submit the final application in the Outreachy website

atienosonia commented 5 months ago

successfully installed ersilia. tested that ersilia works using ersilia --help and ersilia catalog. I ran the commands on my anaconda terminal as the administrator. trying to test a simple model but its giving me the below error

🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

module 'os' has no attribute 'copy'

tried running the command in verbose mode as suggested but I still get the same output error

Malikbadmus commented 5 months ago

@atienosonia , can you save the bug output in a file and share.

It will make it easier to debug

atienosonia commented 5 months ago

ersilia_test_model_bug.txt @Malikbadmus can you access the file ?

Malikbadmus commented 5 months ago

@atienosonia, Yes I can.

Did you run this command pip install -e . after cloning the ersilia repo and navigating to the directory?

Ajoke23 commented 5 months ago

successfully installed ersilia. tested that ersilia works using ersilia --help and ersilia catalog. I ran the commands on my anaconda terminal as the administrator. trying to test a simple model but its giving me the below error

🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

module 'os' has no attribute 'copy'

tried running the command in verbose mode as suggested but I still get the same output error

@atienosonia Try the following:

reinstalling islaura file, version 0.1
which version of python are you using?. Make sure the version of python you're using is between >=3.7 and <=3.11
the error that Ersilia isn't working might either be network error or incomplete installation. During your installation process, did you run:
```
pip install -e .
```

Ensure you add the dot after -e

Do this and let me know if it works

atienosonia commented 5 months ago

I was able to solve the bug. I was not able to fetch the model because I was running on windows. I shifted to WSL Ubuntu, my version is 18.04 , then redid the prerequisite installations one by one and I was able to test the model successfully.

Ajoke23 commented 5 months ago

I was able to solve the bug. I was not able to fetch the model because I was running on windows. I shifted to WSL Ubuntu, my version is 18.04 , then redid the prerequisite installations one by one and I was able to test the model successfully.

Glad you were able to resolve the bug.

atienosonia commented 5 months ago

Motivation Letter

There is a statement that goes “No one is free until we are all free.”This statement recognizes our humanity, experiences, struggles, and need for a community. It also highlights the fact that we must work together to bring about change.

Open-source software is more often developed by communities, this approach ensures flexibility and longevity of the software. This collaborative approach allows for each user's experience to be included in the building of the open-source software. This would be my first time contributing to such a movement.

I have one year of experience in data science using the Python programming language, I have done data science projects that have utilised scikit-learn, this is a machine learning framework in Python. Through these projects, I have been able to build my research and analytical skills which have fostered my growth in quantitative research. In addition, I rely on Git for version control and GitHub for team collaborations.

The Ersilia project realises there is a disconnect between individuals who build these machine-learning tools and the professionals who work in the industry. I would like to work with Ersilia to bring this adoption of technology into day-to-day research in laboratories and hospitals. Where I come from, technology adoption mostly looks at file management and medical imaging. I not only want to build these machine learning models that help in drug discovery, but I also want to work with scientists and researchers to understand their needs and pain points.

Contributing to the Ersilia project will allow me to learn how these machine-learning models are built, advance my skills in writing research reports, and come up with relevant research questions. In addition, the guidance in mentorship would support my professional growth and development.

After the internship, I would like to immerse myself in the research and documentation of drug discovery and testing the machine learning models with diseases disregarded in my country.

DhanshreeA commented 5 months ago

Thanks for the updates @atienosonia, yes for Windows users, WSL is recommended.

atienosonia commented 5 months ago

Question

I have been reading the paper for the model I have chosen, which is eos9tyg. The model looks at membrane permeability. On the paper, the dataset that was used to make predictions was PAMPA pH 5, you can find the data on PubChem- AID: 1645871, or you can click here! and get to the site where the dataset is then choose PAMPA pH 5 and download the data file. When looking at the CSV file, I realized it only had five columns which are PUBCHEM_SID, PUBCHEM_CID, PUBCHEM_ACTIVITY_OUTCOME, Phenotype (0-10: Low Permeability; 10-100: Moderate Permeability; >100: High Permeability) and Permeability. According to the Ersilia Book, the input molecules are supposed to be a SMILES string and I seem to be missing that column. Can the PUBCHEM_CID column be used inplace of the missing SMILES column ? I understand that CID, according to PubChem means a non-zero integer for a unique chemical structure. If using the PubChem CID column is not the right way to go about making predictions , then how should I proceed in terms of getting the right dataset to make predictions?

DhanshreeA commented 5 months ago

Hi @atienosonia thank you for the question. I saw the dataset you referred to, unfortunately PubChem SID type is not an accepted input to work with drug discovery models in my knowledge. I would recommend using a PubChem API by which, given a PubChem SID, you can obtain its corresponding SMILES string. Hope this helps.

atienosonia commented 5 months ago

thank you

On Tue, 12 Mar 2024 at 16:47, Dhanshree Arora @.***> wrote:

Hi @atienosonia https://github.com/atienosonia thank you for the question. I saw the dataset you referred to, unfortunately PubChem SID type is not an accepted input to work with drug discovery models in my knowledge. I would recommend using a PubChem API by which, given a PubChem SID, you can obtain its corresponding SMILES string. Hope this helps.

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/1003#issuecomment-1991692062, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUJEOUZ7PZJDPWW7MHB2OY3YX4BNJAVCNFSM6AAAAABEJCL6YKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJRGY4TEMBWGI . You are receiving this because you were mentioned.Message ID: @.***>

atienosonia commented 5 months ago

Review and Feedback

hey @DhanshreeA , I understand you were reviewing issues today. I'm a little bit behind and I didn't want the day to end without you reviewing my Task 1 for this week. I used Ersilia's python package to interact with the model of choosing (eos2ta5) and generate predictions. I opted to use the dataset you sent. I am able to fetch, serve and generate predictions of the model, however if you look at my index.ipynb file you will notice this particular warning sudo: unknown user: udockerusername sudo: unable to initialize policy plugin, the warning occurs after serving the model. I understand its an issue with configurations, I have been trying to fix it but I haven't managed. I will appreciate any help on the same. I have only drawn one plot from my predictions but I feel that doesn't capture enough information regarding the predictions and the model, what other plots would you suggest or what should I look at that can be visualized based on the model I have used ? Please review my READMe and let me know what information to add, I'm not done with it yet because I wanted to get your feedback first on running the 1000 molecules but feel free to comment on it. I would appreciate any other feedback that you would like to add. You can find the link to my github user account here!

Ajoke23 commented 5 months ago

Review and Feedback

hey @DhanshreeA , I understand you were reviewing issues today. I'm a little bit behind and I didn't want the day to end without you reviewing my Task 1 for this week. I used Ersilia's python package to interact with the model of choosing (eos2ta5) and generate predictions. I opted to use the dataset you sent. I am able to fetch, serve and generate predictions of the model, however if you look at my index.ipynb file you will notice this particular warning sudo: unknown user: udockerusername sudo: unable to initialize policy plugin, the warning occurs after serving the model. I understand its an issue with configurations, I have been trying to fix it but I haven't managed. I will appreciate any help on the same. I have only drawn one plot from my predictions but I feel that doesn't capture enough information regarding the predictions and the model, what other plots would you suggest or what should I look at that can be visualized based on the model I have used ? Please review my READMe and let me know what information to add, I'm not done with it yet because I wanted to get your feedback first on running the 1000 molecules but feel free to comment on it. I would appreciate any other feedback that you would like to add. You can find the link to my github user account here!

Hi @atienosonia Since you are working on the herg blockage, I will suggest using scatter plot. Scatter plot is used for numerical variables. Firstly, you can create a scatter plot for the predicted value (i.e probability column) against dataframe.index e.g plt.scatter(df.index, df['probability'])

Due to the objective of the study, you can set a threshold probability to classify a compound as hERG blocker and hERG non-blocker.

For example, if threshold probability is set to 0.5, you can set a statement like: if predicted value>=0.5 consider it hERG blocker else, consider it hERG non blocker. With this you can use .value_counts() to know the number of compounds that are Blocker and Non-blocker. With that information, you can use a bar chart to show the distribution of hERG classification

SCATTER PLOT TWO NUMERICAL VALUES

To achieve this since your predicted value has one numerical column, I would recommend featuring the smiles column using Morgan Fingerprint and also, reduce the dimensionality features to have them in X axis and Y axis. With this, your scatter plot will have both X and Y axis as numerical values.

To reduce dimensionality, I would recommend using PCA or UMAP or both depending on you. PCA - Principal Component Analysis UMAP - Uniform Manifold Approximation and Projection With this, you can plot a scatter plot showing PCA or UMAP

atienosonia commented 5 months ago

Review and Feedback

hey @DhanshreeA , I understand you were reviewing issues today. I'm a little bit behind and I didn't want the day to end without you reviewing my Task 1 for this week. I used Ersilia's python package to interact with the model of choosing (eos2ta5) and generate predictions. I opted to use the dataset you sent. I am able to fetch, serve and generate predictions of the model, however if you look at my index.ipynb file you will notice this particular warning sudo: unknown user: udockerusername sudo: unable to initialize policy plugin, the warning occurs after serving the model. I understand its an issue with configurations, I have been trying to fix it but I haven't managed. I will appreciate any help on the same. I have only drawn one plot from my predictions but I feel that doesn't capture enough information regarding the predictions and the model, what other plots would you suggest or what should I look at that can be visualized based on the model I have used ? Please review my READMe and let me know what information to add, I'm not done with it yet because I wanted to get your feedback first on running the 1000 molecules but feel free to comment on it. I would appreciate any other feedback that you would like to add. You can find the link to my github user account here!

@DhanshreeA disregard the first github link ,please use this link instead to review my week 2 task 1

atienosonia commented 5 months ago

Review and Feedback

hey @DhanshreeA , I understand you were reviewing issues today. I'm a little bit behind and I didn't want the day to end without you reviewing my Task 1 for this week. I used Ersilia's python package to interact with the model of choosing (eos2ta5) and generate predictions. I opted to use the dataset you sent. I am able to fetch, serve and generate predictions of the model, however if you look at my index.ipynb file you will notice this particular warning sudo: unknown user: udockerusername sudo: unable to initialize policy plugin, the warning occurs after serving the model. I understand its an issue with configurations, I have been trying to fix it but I haven't managed. I will appreciate any help on the same. I have only drawn one plot from my predictions but I feel that doesn't capture enough information regarding the predictions and the model, what other plots would you suggest or what should I look at that can be visualized based on the model I have used ? Please review my READMe and let me know what information to add, I'm not done with it yet because I wanted to get your feedback first on running the 1000 molecules but feel free to comment on it. I would appreciate any other feedback that you would like to add. You can find the link to my github user account here!

Hi @atienosonia Since you are working on the herg blockage, I will suggest using scatter plot. Scatter plot is used for numerical variables. Firstly, you can create a scatter plot for the predicted value (i.e probability column) against dataframe.index e.g plt.scatter(df.index, df['probability'])

Due to the objective of the study, you can set a threshold probability to classify a compound as hERG blocker and hERG non-blocker.

For example, if threshold probability is set to 0.5, you can set a statement like: if predicted value>=0.5 consider it hERG blocker else, consider it hERG non blocker. With this you can use .value_counts() to know the number of compounds that are Blocker and Non-blocker. With that information, you can use a bar chart to show the distribution of hERG classification

SCATTER PLOT TWO NUMERICAL VALUES

To achieve this since your predicted value has one numerical column, I would recommend featuring the smiles column using Morgan Fingerprint and also, reduce the dimensionality features to have them in X axis and Y axis. With this, your scatter plot will have both X and Y axis as numerical values.

To reduce dimensionality, I would recommend using PCA or UMAP or both depending on you. PCA - Principal Component Analysis UMAP - Uniform Manifold Approximation and Projection With this, you can plot a scatter plot showing PCA or UMAP

@Ajoke23 thank you for your suggestions, I will check on them .

Ajoke23 commented 5 months ago

Review and Feedback

hey @DhanshreeA , I understand you were reviewing issues today. I'm a little bit behind and I didn't want the day to end without you reviewing my Task 1 for this week. I used Ersilia's python package to interact with the model of choosing (eos2ta5) and generate predictions. I opted to use the dataset you sent. I am able to fetch, serve and generate predictions of the model, however if you look at my index.ipynb file you will notice this particular warning sudo: unknown user: udockerusername sudo: unable to initialize policy plugin, the warning occurs after serving the model. I understand its an issue with configurations, I have been trying to fix it but I haven't managed. I will appreciate any help on the same. I have only drawn one plot from my predictions but I feel that doesn't capture enough information regarding the predictions and the model, what other plots would you suggest or what should I look at that can be visualized based on the model I have used ? Please review my READMe and let me know what information to add, I'm not done with it yet because I wanted to get your feedback first on running the 1000 molecules but feel free to comment on it. I would appreciate any other feedback that you would like to add. You can find the link to my github user account here!

Hi @atienosonia Since you are working on the herg blockage, I will suggest using scatter plot. Scatter plot is used for numerical variables. Firstly, you can create a scatter plot for the predicted value (i.e probability column) against dataframe.index e.g plt.scatter(df.index, df['probability'])

Due to the objective of the study, you can set a threshold probability to classify a compound as hERG blocker and hERG non-blocker.

For example, if threshold probability is set to 0.5, you can set a statement like: if predicted value>=0.5 consider it hERG blocker else, consider it hERG non blocker. With this you can use .value_counts() to know the number of compounds that are Blocker and Non-blocker. With that information, you can use a bar chart to show the distribution of hERG classification

SCATTER PLOT TWO NUMERICAL VALUES

To achieve this since your predicted value has one numerical column, I would recommend featuring the smiles column using Morgan Fingerprint and also, reduce the dimensionality features to have them in X axis and Y axis. With this, your scatter plot will have both X and Y axis as numerical values.

To reduce dimensionality, I would recommend using PCA or UMAP or both depending on you. PCA - Principal Component Analysis UMAP - Uniform Manifold Approximation and Projection With this, you can plot a scatter plot showing PCA or UMAP

@Ajoke23 thank you for your suggestions, I will check on them .

You're welcome. Let's know how it goes here.

DhanshreeA commented 5 months ago

@atienosonia could you please adhere to the structure of the repository as explained in the gitbook? I don't see a data folder, or a figures folder. I don't understand the way you have organized the notebooks by tasks either. It's quite hard to review this work. Please update and let me know, thank you.

atienosonia commented 5 months ago

@atienosonia could you please adhere to the structure of the repository as explained in the gitbook? I don't see a data folder, or a figures folder. I don't understand the way you have organized the notebooks by tasks either. It's quite hard to review this work. Please update and let me know, thank you.

@DhanshreeA I have added both a data and a figures folder. I have also organized my notebooks based on the tasks given for week 2 . I would appreciate if you review my work

atienosonia commented 5 months ago

cardiotoxlog.txt

Week 2 task 2 update

hello @DhanshreeA I'm having troubles performing model reproducibility, here is the publication for the model I chose, the model they used is called CardioTox net, this is the source code and steps on installation. I have tried running the installation but when I run the last step python test.py I get the error ModuleNotFoundError: No module named 'numpy.random.bit_generator', I have also attached log files producing the error. The source code was last updated 3-4 years ago. The model is also not available among the python packages and so I can't use pip to install the model. I could take up a different model and try completing the task, let me know if this is possible given that the contribution period is almost over and I might not get any guidance on my work

Ajoke23 commented 5 months ago

cardiotoxlog.txt

Week 2 task 2 update

hello @DhanshreeA I'm having troubles performing model reproducibility, here is the publication for the model I chose, the model they used is called CardioTox net, this is the source code and steps on installation. I have tried running the installation but when I run the last step python test.py I get the error ModuleNotFoundError: No module named 'numpy.random.bit_generator', I have also attached log files producing the error. The source code was last updated 3-4 years ago. The model is also not available among the python packages and so I can't use pip to install the model. I could take up a different model and try completing the task, let me know if this is possible given that the contribution period is almost over and I might not get any guidance on my work

From the error log file, I presume you ran the test on Ubuntu. You can use pip to install the package on Ubuntu. I also worked on the CardioTox model and I was able to successfully implement the author's code and got exact output result as seen in the publication paper.

Few things to note: specify the exact version of tensorflow, keras, pybel, Modred and sckit learn you want to install. Do well to check the exact version number on the requirements.txt file on the CardioTox GitHub page.

NOTE: you need to install those versions first before running python test.py

atienosonia commented 5 months ago

cardiotoxlog.txt

Week 2 task 2 update

hello @DhanshreeA I'm having troubles performing model reproducibility, here is the publication for the model I chose, the model they used is called CardioTox net, this is the source code and steps on installation. I have tried running the installation but when I run the last step python test.py I get the error ModuleNotFoundError: No module named 'numpy.random.bit_generator', I have also attached log files producing the error. The source code was last updated 3-4 years ago. The model is also not available among the python packages and so I can't use pip to install the model. I could take up a different model and try completing the task, let me know if this is possible given that the contribution period is almost over and I might not get any guidance on my work

From the error log file, I presume you ran the test on Ubuntu. You can use pip to install the package on Ubuntu. I also worked on the CardioTox model and I was able to successfully implement the author's code and got exact output result as seen in the publication paper.

Few things to note: specify the exact version of tensorflow, keras, pybel, Modred and sckit learn you want to install. Do well to check the exact version number on the requirements.txt file on the CardioTox GitHub page.

NOTE: you need to install those versions first before running python test.py

@Ajoke23 your support is truly admirable, thank you. I managed to bypass the error except I got another error that my TensorFlow was incompatible with my NumPy version. Nevertheless, I was able to generate the evaluation metrics . I'm going to attach my output after running python test.py. I have a question on the same too. Am I only required to generate the evaluation metrics from my command line , note them down and compare them to the publication then now move to my notebook and generate evaluation metrics using the ersilia model then compare the results again ? Is this how its supposed to be done ? because initially I thought I had to run both cardiotox and eos2ta5 on my notebook and record the evaluation metrics of both cardiotoxeval

Ajoke23 commented 5 months ago

cardiotoxlog.txt

Week 2 task 2 update

hello @DhanshreeA I'm having troubles performing model reproducibility, here is the publication for the model I chose, the model they used is called CardioTox net, this is the source code and steps on installation. I have tried running the installation but when I run the last step python test.py I get the error ModuleNotFoundError: No module named 'numpy.random.bit_generator', I have also attached log files producing the error. The source code was last updated 3-4 years ago. The model is also not available among the python packages and so I can't use pip to install the model. I could take up a different model and try completing the task, let me know if this is possible given that the contribution period is almost over and I might not get any guidance on my work

From the error log file, I presume you ran the test on Ubuntu. You can use pip to install the package on Ubuntu. I also worked on the CardioTox model and I was able to successfully implement the author's code and got exact output result as seen in the publication paper.

Few things to note: specify the exact version of tensorflow, keras, pybel, Modred and sckit learn you want to install. Do well to check the exact version number on the requirements.txt file on the CardioTox GitHub page.

NOTE: you need to install those versions first before running python test.py

@Ajoke23 your support is truly admirable, thank you. I managed to bypass the error except I got another error that my TensorFlow was incompatible with my NumPy version. Nevertheless, I was able to generate the evaluation metrics . I'm going to attach my output after running python test.py. I have a question on the same too. Am I only required to generate the evaluation metrics from my command line , note them down and compare them to the publication then now move to my notebook and generate evaluation metrics using the ersilia model then compare the results again ? Is this how its supposed to be done ? because initially I thought I had to run both cardiotox and eos2ta5 on my notebook and record the evaluation metrics of both

Thank you for the nice comment. Yes, you are right. Exactly how you have stated it. I ran CardioTox on Ubuntu and implemented the dataset the author used on Ersilia model using Jupyter notebook If you have the same result after implementing on Ersilia, then it's reproducible.

DhanshreeA commented 5 months ago

Hi @atienosonia good work so far! Thank you for restructuring the repository nicely. You can go ahead and work on the final application!

atienosonia commented 5 months ago

Hi @atienosonia good work so far! Thank you for restructuring the repository nicely. You can go ahead and work on the final application!

thank you @DhanshreeA

atienosonia commented 5 months ago

cardiotoxlog.txt

Week 2 task 2 update

hello @DhanshreeA I'm having troubles performing model reproducibility, here is the publication for the model I chose, the model they used is called CardioTox net, this is the source code and steps on installation. I have tried running the installation but when I run the last step python test.py I get the error ModuleNotFoundError: No module named 'numpy.random.bit_generator', I have also attached log files producing the error. The source code was last updated 3-4 years ago. The model is also not available among the python packages and so I can't use pip to install the model. I could take up a different model and try completing the task, let me know if this is possible given that the contribution period is almost over and I might not get any guidance on my work

From the error log file, I presume you ran the test on Ubuntu. You can use pip to install the package on Ubuntu. I also worked on the CardioTox model and I was able to successfully implement the author's code and got exact output result as seen in the publication paper. Few things to note: specify the exact version of tensorflow, keras, pybel, Modred and sckit learn you want to install. Do well to check the exact version number on the requirements.txt file on the CardioTox GitHub page. NOTE: you need to install those versions first before running python test.py

@Ajoke23 your support is truly admirable, thank you. I managed to bypass the error except I got another error that my TensorFlow was incompatible with my NumPy version. Nevertheless, I was able to generate the evaluation metrics . I'm going to attach my output after running python test.py. I have a question on the same too. Am I only required to generate the evaluation metrics from my command line , note them down and compare them to the publication then now move to my notebook and generate evaluation metrics using the ersilia model then compare the results again ? Is this how its supposed to be done ? because initially I thought I had to run both cardiotox and eos2ta5 on my notebook and record the evaluation metrics of both

Thank you for the nice comment. Yes, you are right. Exactly how you have stated it. I ran CardioTox on Ubuntu and implemented the dataset the author used on Ersilia model using Jupyter notebook If you have the same result after implementing on Ersilia, then it's reproducible.

thank you @Ajoke23