✍️ Contribution period: Ajoke Yusuf

Ajoke23 commented 4 months ago

Week 1 - Get to know the community

[x] Join the communication channels
[x] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Install Docker if needed, and test another model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Get Familiar with Machine Learning for Chemistry

[x] Select a model from the list suggested in GitBook
[x] Download and serve the model via the Ersilia Model Hub to ensure it works
[x] Open a repository on your GitHub user with all the necessary files
[x] Select and clean a dataset of 1000 molecules (example notebook 1)
[x] Run predictions for the molecules on the selected model and evaluate the results

Week 3 - Validate a Model in the Wild

[x] Find a suitable dataset with sufficient experimental results
[x] Clean and standardize the dataset
[x] Run predictions and calculate metrics.

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

Ajoke23 commented 4 months ago

Week 1 DAY 1 (5th March, 2024)

I joined the Slack communication channel on 5th March 2024 to express my interest in contributing to the success of Ersilia's project.
I introduced myself in the #generalchannel which can be found here
I have been following Ersilia GitHub page and I have the repository forked and starred. So, I went ahead to check out Ersilia issue and what Ersilia has been working on.
I came across Ersilia code and conduct which I read.

DAY 2 (6th March, 2024)

I went through this documentation on Ersilia Model Hub installation
I started the installation process and encountered an error when I got to pre-requisites 5: isaura data lake installation. The error encountered can be viewed here isaura.txt
Then, in the process of debugging I decided to install Ersilia first using this code
```
# create a conda environment
conda create -n ersilia python=3.7
# activate the environment
conda activate ersilia
```
Aftermath, I now decided to run the code for Isaura data lake installation and it was successful. Then I realized that error i was getting earlier was because h5py requires a python installed to install wheel h5py. Hence the reason why setting up Ersilia environment was important first before installation of Isaura.
I proceeded with other steps outlined which was successful. To be sure I've successfully installed Ersilia and the CLI terminal is working, I ran the following code
```
# see ersilia CLI options
ersilia --help
#see ersilia's model catalog
ersilia catalog
```
Output: output1.log catalog_output.log This output shows that I have successfully installed Ersilia and CLI terminal is working fine.
Now that Ersilia is recognized in Ubuntu, I tested eos3b5e models by fetching, serving eos3b5e model, and calculating the molecular weight as required in the task. I got the following outputs: fetch.log, serve.log, model_output.log

DAY 3 (7th March, 2024)

I started with installation of docker. Before proceeding to the installation, I took my time to thoroughly read docker documentation
To begin installation on docker on Ubuntu, I took the following steps:
1. Updated the existing list of packages using sudo apt update
2. Installed required dependencies using sudo apt install apt-transport-https ca-certificates curl software-properties-common
5. I make sure the docker was installed from the repository using apt-cache policy docker-ce
6. Aftermath, I started & enabled docker using the following command;
```
sudo systemctl start docker
sudo systemctl enable docker
```
7. To be sure docker was installed, I ran this code docker --version to check the version of docker installed and i got the output below
```
(base) ajoke@DESKTOP-KTJU3QV:~$ docker --version
Docker version 25.0.4, build 1a576c5
```
  TESTING OF MODEL eos4wt0

I pull model eos4wt0 model from Ersilia model hub and i got this output


(base) ajoke@DESKTOP-KTJU3QV:~$ sudo docker pull ersiliaos/eos4wt0:latest
latest: Pulling from ersiliaos/eos4wt0
8b91b88d5577: Already exists
824416e23423: Already exists
bbe2c2981082: Already exists
7b6b68d15a5c: Already exists
71f8f4db541d: Already exists
4f4fb700ef54: Pull complete
b29b0c06109d: Already exists
ddc20b6d4ab1: Pull complete
bb4587482098: Pull complete
28489519aef7: Pull complete
35554e140baa: Pull complete
Digest: sha256:9738b7353c56e9d26373edd73e6ff299166322b9cbd1513ff3ed85133d038e90
Status: Downloaded newer image for ersiliaos/eos4wt0:latest
docker.io/ersiliaos/eos4wt0:latest

- Then i ran `sudo docker ps` 
**Output:**

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b458fce09d65 ersiliaos/eos4wt0:latest "sh /root/docker-ent…" 3 hours ago Up 3 hours 0.0.0.0:37183->80/tcp eos4wt0_d75e

- I tested the model eos4swt0 that was fetched by running this codel below:

ersilia serve eos4wt0 ersilia -v api run -i "CCCC"


**Output gotten:** [model.log](https://github.com/ersilia-os/ersilia/files/14551529/model.log) which same output in the gitbook.

**DAY 4 (8th of March 2024)** 
**MOTIVATION STATEMENT**
I'm Ajoke Yusuf, a Data Scientist, Machine Learning enthusiast, and SDG 3 advocate. I'm a hardworking, resourceful, goal-oriented individual who possesses strong analytical and problem-solving skills with an unending quest for knowledge. I pride myself on being a fast learner and honed strong skills in problem-solving and research.
My last experience of contributing to Ersilia during the contribution stage last year in October 2023, I decided to apply again hoping Ersilia project will be there because I had a wonderful and learnable experience coupled with the amazing community. Receiving the Outreachy email and checking the project named, I decided to scroll down to alphabet "E" to look for Ersilia and I felt so excited seeing Ersilia project. 

My main aim of choosing Ersilia project is because their aim, and mission align with my goal and career objective as an impact maker and an SDG 3 advocate. One of the life experiences that ignited my interest in drug discovery due to my personal experience with cerebral malaria that almost took my life and also the death of a friend who lost her life due to sickle cell.

As an Engineering graduate living in Nigeria, the increasing mortality rate of infectious diseases in Nigeria and sub-Saharan Africa is alarming, hence I developed an interest in the biomedical field. Based on statistics, according to UNICEF (United Nations International Children Emergency Funds), **_infectious disease is the major cause of the mortality rate in children ≤ 5 years_** which was cited from this [article](https://data.unicef.org/topic/child-survival/under-five-mortality/). Research from NIH (National Library of Medicine) & NCBI (National Centre for Biotechnology Information) confirms that _**"The infrastructure and level of support for surveillance, research, and training on emerging infectious diseases in Africa are extremely limited"**_ which was cited from this [article](https://www.ncbi.nlm.nih.gov/books/NBK99567/#:~:text=At%20a%20time%20when%20increasing,%2C%20yellow%20fever%2C%20and%20trypanosomiasis).

With my knowledge in Python for Data Science, a bit knowledge in Machine Learning in conjunction with strong analytical and research skill, I believe that contributing to this project will help me garner knowledge and technical skills that will help in advancing and improving health research in Nigeria, Africa and eventually the world.

**If accepted for the 3 months internship**, I'll commit myself to bringing suggestions, undergoing research, and collaborating with the Ersilia team while learning and honing skills in Artificial Intelligence and Machine Learning. This period of internship will help propel my research and problem-solving skills which will be useful in the long run for the advancement of technology in the health sector, improving and making a sustainable impact in health research in Nigeria because I fiercely believe that the availability & accessibility of scientific tools and data-driven insights is necessary towards solving prevalent health challenges because as a young lady living in Nigeria, an underdeveloped and low-income country, I have experienced challenges encountered in having accessibility tools for prevalent infectious diseases in my community and country (Nigeria) at large.
**After the internship**, I plan on utilizing the skills gained to improve and sustain health research tools, solve prevalent health disease issues in Nigeria, and reduce the mortality rate caused by infectious diseases. Thus, propelling sustainable research skills that will leave a long-lasting impact in the health sector in my community, Nigeria, sub-Saharan Africa, and eventually, globally.

Ajoke23 commented 4 months ago

DAY 4 (8th of March 2024)

MOTIVATION STATEMENT

I'm Ajoke Yusuf, a Data Scientist, Machine Learning enthusiast, and SDG 3 advocate. I'm a hardworking, resourceful, goal-oriented individual who possesses strong analytical and problem-solving skills with an unending quest for knowledge. I pride myself on being a fast learner and honed strong skills in problem-solving and research. Due to my last memorable experience during the contribution stage last year in October 2023, I got ignited to apply again hoping Ersilia project will be there because I had a wonderful and learnable experience coupled with the amazing community. Receiving the Outreachy email and checking the project named, I decided to scroll down to alphabet "E" to look for Ersilia and I felt so excited seeing Ersilia project.

My main aim of choosing Ersilia project is because their aim, and mission align with my goal and career objective as an impact maker and an SDG 3 advocate. One of the life experiences that ignited my interest in drug discovery is due to my personal experience with cerebral malaria that almost took my life and also the death of a friend who lost her life due to sickle cell.

As an Engineering graduate living in Nigeria, the increasing mortality rate of infectious diseases in Nigeria and sub-Saharan Africa is alarming, hence I developed an interest in the biomedical field. Based on statistics, according to UNICEF (United Nations International Children Emergency Funds), infectious disease is the major cause of the mortality rate in children ≤ 5 years which was cited from this article. Research from NIH (National Library of Medicine) & NCBI (National Centre for Biotechnology Information) confirms that "The infrastructure and level of support for surveillance, research, and training on emerging infectious diseases in Africa are extremely limited" which was cited from this article.

With my knowledge in Python for Data Science, a bit knowledge in Machine Learning in conjunction with strong analytical and research skill, I believe that contributing to this project will help me garner knowledge and technical skills that will help in advancing and improving health research in Nigeria, Africa and eventually the world.

If accepted for the 3 months internship, I'll commit myself to bringing suggestions, undergoing research, and collaborating with the Ersilia team while learning and honing skills in Artificial Intelligence and Machine Learning. This period of internship will help propel my research and problem-solving skills which will be useful in the long run for the advancement of technology in the health sector, improving and making a sustainable impact in health research in Nigeria because I fiercely believe that the availability & accessibility of scientific tools and data-driven insights is necessary towards solving prevalent health challenges because as a young lady living in Nigeria, an underdeveloped and low-income country, I have experienced challenges encountered in having accessibility tools for prevalent infectious diseases in my community and country (Nigeria) at large.

After the internship, I plan on utilizing the skills gained to improve and sustain health research tools, solve prevalent health disease issues in Nigeria, and reduce the mortality rate caused by infectious diseases. Thus, propelling sustainable research skills that will leave a long-lasting impact in the health sector in my community, Nigeria, sub-Saharan Africa, and eventually, globally.

Ajoke23 commented 3 months ago

WEEK 2 TASK 1 Model selected - eos2ta5 Repository created - here @DhanshreeA @GemmaTuron Pls I appreciate any feedback.

TASK 2: MODEL REPRODUCIBILITY A well-detailed explanation can be found in the repository above.

IMPLEMENTATION OF THE AUTHORS MODEL I took the following step in implementation the author source code using Ubuntu terminal

I already had conda dependencies installed

Set up the cardiotox package on conda environment

# create a conda environment
conda create -n cardiotox python=3.7.7
# activate the environment
conda activate cardiotox

Installing of PyBioMed

cd cardiotox
cd PyBioMed
python setup.py install

return back to the home cd ..

Installing the package's version the authors used

pip install tensorflow==2.3.1
pip install sklearn==0.0
pip install mordred==1.2.0
pip install pybel==0.14.10
pip install keras==2.4.3

Testing the model python test.py

OUTPUT: Author's Source Code Output

RESULT COMPARISON OF THE CARDIOTOX & eos2ta5 MODEL

Test set-I result: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

MODEL | MCC | NPV | ACC | PPV | SPE | SEN | B-ACC -- | -- | -- | -- | -- | -- | -- | -- eos2ta5 | 0.599 | 0.688 | 0.818 | 0.893 | 0.786 | 0.833 | 0.810 Cardiotox | 0.599 | 0.688 | 0.810 | 0.893 | 0.786 | 0.833 | 0.810

Test Set-II Result: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

MOODEL | MCC | NPV | ACC | PPV | SPE | SEN | B-ACC -- | -- | -- | -- | -- | -- | -- | -- eos2ta5 | 0.452 | 0.688 | 0.683 | 0.455 | 0.6 | 0.909 | 0.754 Cardiotox | 0.452 | 0.688 | 0.755 | 0.455 | 0.6 | 0.909 | 0.755

Ajoke23 commented 3 months ago

Hi @DhanshreeA I can see that you reacted to my comment here. Any feedback so far? My week 2 task 2 (model reproducibility is still in progress) but I will appreciate if any feedback is given regarding week 2 (task 1)

Ajoke23 commented 3 months ago

Hi @DhanshreeA, I have completed the task 2 (week 2) model reproducibility task. I look forward to your feedback and advice so as to transfer the same knowledge to other contributors when helping each other.

A well detailed readme concerning Week 2 (both tasks 1 & 2) can be found here

Ajoke23 commented 3 months ago

Hi @DhanshreeA I am yet to get feedback regarding task 2. Please, I would appreciate any feedback so far from you.Your feedback is dependent on Week 3 task. Thank you for your time. I am looking forward to getting feedback from you

DhanshreeA commented 3 months ago

Hi @Ajoke23, thank you for your patience, and great progress so far! Few comments:

I see that you are using rdkit module to standardize smiles, whereas this is only creating canonical smiles. To standardize, you have to use the standardizer module as shown through the example code in the model validations template provided by us.
I don't see an implementation of the model from the paper, rather only the use of the ersilia implementation. Any reason why that's the case? Please point me to it if I have missed it.

Ajoke23 commented 3 months ago

Hi @Ajoke23, thank you for your patience, and great progress so far! Few comments:

I see that you are using rdkit module to standardize smiles, whereas this is only creating canonical smiles. To standardize, you have to use the standardizer module as shown through the example code in the model validations template provided by us.

I don't see an implementation of the model from the paper, rather only the use of the ersilia implementation. Any reason why that's the case? Please point me to it if I have missed it.

Hi @DhanshreeA. Thank you for the feedback.

I have made corrections to that. I had to import the standardise module. I added this line of code from standardised import standardise. With this I had to rerun all my code and uploaded the new ipynb file to the repository
It was an oversight. I thought I added it. I have updated it here and also in the readme section of my repository created

Ajoke23 commented 3 months ago

Hi @DhanshreeA Can I move to week 3 now since I have made the necessary changes based on your feedback? I look forward to receiving feedback from you. Thank you for your time.

GemmaTuron commented 3 months ago

Thansk @Ajoke23 we will provide feedback within the day today!

Ajoke23 commented 3 months ago

Thansk @Ajoke23 we will provide feedback within the day today!

Thank you so much @GemmaTuron. I look forward to the getting the feedback.

DhanshreeA commented 3 months ago

Hi @Ajoke23, this looks good to me. Please proceed to task 3, and I will review it on Monday. Thereafter you can work on your final application.

Ajoke23 commented 3 months ago

Hi @Ajoke23, this looks good to me. Please proceed to task 3, and I will review it on Monday. Thereafter you can work on your final application.

Alright @DhanshreeA . Thanks a lot for the feedback. I will start immediately

Ajoke23 commented 3 months ago

WEEK 3: VALIDATE MODEL IN A WILD Link to the code Dataset Used: Was gotten from this a publication Github page which can be found here

Data Leakage In this task, I ensure that inchikey present in the experimental dataset (used for performance evaluation) is not included in the training dataset used to build the predictive model. In the process of data leakage, The number of molecules from external datasets present in training datasets is: 7740. I dropped those leaked data. For accuracy and model performance, it's advisable to always remove leaked data to avoid model biases and to improve model performance. Thus, making your evaluation dataset independent from training dataset. The sources of the data leakage were the public repositories where the dataset was obtained. Both the experimental dataset and training dataset used by the author were gotten from Chembl, Pubchem.

Summary Statistics <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

| Molecules -- | -- Training Data | 12620 Validation Data | 870

EVALUATION METRICS The model falls under the classification type. So, I used several evaluation metrics that are used in classification model. The evaluation metrics used include: MCC, NPV, PPV, ACC, SEN, SPE, B-ACC & AUROC curve.

Data | Model | MCC | NPV | ACC | PPV | SPE | SEN | B-ACC | AUC SCORE -- | -- | -- | -- | -- | -- | -- | -- | -- | -- Validation Dataset | eos2ta5 | 0.326 | 0.573 | 0.661 | 0.748 | 0.693 | 0.639 | 0.666 | 0.7

The PPV & NPV metrics indicating better performance in certain aspects of classification while the remaining evaluation metrics suggest moderate performance.AUC score of 0.70 also proves that the model has a predicting cability to distinguish between drug that are hERG blocker and hERG non-blocker. From this evaluation metrics, it shows that the model performed moderately well and have the predicting ability to identify hERG blocker.

@DhanshreeA I have update the week 3 task here. Please, can you have a look at it?

DhanshreeA commented 3 months ago

Hi @Ajoke23 any updates to report?

Ajoke23 commented 3 months ago

Hi @Ajoke23 any updates to report?

Yes, there is. It will be updated shortly. My system got crashed during the weekend. I have fixed my laptop and I am in the process creating auroc curve and evaluating the experimental dataset based on various evaluation metrics

Would you be chanced to give me a feedback before the end of today? Pls, I will really appreciate if this is possible. Thank you

Ajoke23 commented 3 months ago

Hi @DhanshreeA @GemmaTuron I have gone ahead to start preparing for my final application. My project template is in progress and will be completed before the end of today. I have a question, my week 3 was updated today. Can I still go ahead to create a final application without a feedback? Will it affect my chances of getting accepted because I updated the week 3 task today.

ersilia-os / ersilia