ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Ajoke Yusuf #984

Closed Ajoke23 closed 3 months ago

Ajoke23 commented 4 months ago

Week 1 - Get to know the community

Week 2 - Get Familiar with Machine Learning for Chemistry

Week 3 - Validate a Model in the Wild

Week 4 - Prepare your final application

Ajoke23 commented 4 months ago

Week 1 DAY 1 (5th March, 2024)

DAY 2 (6th March, 2024)

DAY 3 (7th March, 2024)

- Then i ran `sudo docker ps` 
**Output:** 

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b458fce09d65 ersiliaos/eos4wt0:latest "sh /root/docker-ent…" 3 hours ago Up 3 hours 0.0.0.0:37183->80/tcp eos4wt0_d75e

- I tested the model eos4swt0 that was fetched by running this codel below:

ersilia serve eos4wt0 ersilia -v api run -i "CCCC"


**Output gotten:** [model.log](https://github.com/ersilia-os/ersilia/files/14551529/model.log) which same output in the gitbook.

**DAY 4 (8th of March 2024)** 
**MOTIVATION STATEMENT**
I'm Ajoke Yusuf, a Data Scientist, Machine Learning enthusiast, and SDG 3 advocate. I'm a hardworking, resourceful, goal-oriented individual who possesses strong analytical and problem-solving skills with an unending quest for knowledge. I pride myself on being a fast learner and honed strong skills in problem-solving and research.
My last experience of contributing to Ersilia during the contribution stage last year in October 2023, I decided to apply again hoping Ersilia project will be there because I had a wonderful and learnable experience coupled with the amazing community. Receiving the Outreachy email and checking the project named, I decided to scroll down to alphabet "E" to look for Ersilia and I felt so excited seeing Ersilia project. 

My main aim of choosing Ersilia project is because their aim, and mission align with my goal and career objective as an impact maker and an SDG 3 advocate. One of the life experiences that ignited my interest in drug discovery due to my personal experience with cerebral malaria that almost took my life and also the death of a friend who lost her life due to sickle cell.

As an Engineering graduate living in Nigeria, the increasing mortality rate of infectious diseases in Nigeria and sub-Saharan Africa is alarming, hence I developed an interest in the biomedical field. Based on statistics, according to UNICEF (United Nations International Children Emergency Funds), **_infectious disease is the major cause of the mortality rate in children ≤ 5 years_** which was cited from this [article](https://data.unicef.org/topic/child-survival/under-five-mortality/). Research from NIH (National Library of Medicine) & NCBI (National Centre for Biotechnology Information) confirms that _**"The infrastructure and level of support for surveillance, research, and training on emerging infectious diseases in Africa are extremely limited"**_ which was cited from this [article](https://www.ncbi.nlm.nih.gov/books/NBK99567/#:~:text=At%20a%20time%20when%20increasing,%2C%20yellow%20fever%2C%20and%20trypanosomiasis).

With my knowledge in Python for Data Science, a bit knowledge in Machine Learning in conjunction with strong analytical and research skill, I believe that contributing to this project will help me garner knowledge and technical skills that will help in advancing and improving health research in Nigeria, Africa and eventually the world.

**If accepted for the 3 months internship**, I'll commit myself to bringing suggestions, undergoing research, and collaborating with the Ersilia team while learning and honing skills in Artificial Intelligence and Machine Learning. This period of internship will help propel my research and problem-solving skills which will be useful in the long run for the advancement of technology in the health sector, improving and making a sustainable impact in health research in Nigeria because I fiercely believe that the availability & accessibility of scientific tools and data-driven insights is necessary towards solving prevalent health challenges because as a young lady living in Nigeria, an underdeveloped and low-income country, I have experienced challenges encountered in having accessibility tools for prevalent infectious diseases in my community and country (Nigeria) at large.
**After the internship**, I plan on utilizing the skills gained to improve and sustain health research tools, solve prevalent health disease issues in Nigeria, and reduce the mortality rate caused by infectious diseases. Thus, propelling sustainable research skills that will leave a long-lasting impact in the health sector in my community, Nigeria, sub-Saharan Africa, and eventually, globally.
Ajoke23 commented 4 months ago

DAY 4 (8th of March 2024)

MOTIVATION STATEMENT

I'm Ajoke Yusuf, a Data Scientist, Machine Learning enthusiast, and SDG 3 advocate. I'm a hardworking, resourceful, goal-oriented individual who possesses strong analytical and problem-solving skills with an unending quest for knowledge. I pride myself on being a fast learner and honed strong skills in problem-solving and research. Due to my last memorable experience during the contribution stage last year in October 2023, I got ignited to apply again hoping Ersilia project will be there because I had a wonderful and learnable experience coupled with the amazing community. Receiving the Outreachy email and checking the project named, I decided to scroll down to alphabet "E" to look for Ersilia and I felt so excited seeing Ersilia project.

My main aim of choosing Ersilia project is because their aim, and mission align with my goal and career objective as an impact maker and an SDG 3 advocate. One of the life experiences that ignited my interest in drug discovery is due to my personal experience with cerebral malaria that almost took my life and also the death of a friend who lost her life due to sickle cell.

As an Engineering graduate living in Nigeria, the increasing mortality rate of infectious diseases in Nigeria and sub-Saharan Africa is alarming, hence I developed an interest in the biomedical field. Based on statistics, according to UNICEF (United Nations International Children Emergency Funds), infectious disease is the major cause of the mortality rate in children ≤ 5 years which was cited from this article. Research from NIH (National Library of Medicine) & NCBI (National Centre for Biotechnology Information) confirms that "The infrastructure and level of support for surveillance, research, and training on emerging infectious diseases in Africa are extremely limited" which was cited from this article.

With my knowledge in Python for Data Science, a bit knowledge in Machine Learning in conjunction with strong analytical and research skill, I believe that contributing to this project will help me garner knowledge and technical skills that will help in advancing and improving health research in Nigeria, Africa and eventually the world.

If accepted for the 3 months internship, I'll commit myself to bringing suggestions, undergoing research, and collaborating with the Ersilia team while learning and honing skills in Artificial Intelligence and Machine Learning. This period of internship will help propel my research and problem-solving skills which will be useful in the long run for the advancement of technology in the health sector, improving and making a sustainable impact in health research in Nigeria because I fiercely believe that the availability & accessibility of scientific tools and data-driven insights is necessary towards solving prevalent health challenges because as a young lady living in Nigeria, an underdeveloped and low-income country, I have experienced challenges encountered in having accessibility tools for prevalent infectious diseases in my community and country (Nigeria) at large.

After the internship, I plan on utilizing the skills gained to improve and sustain health research tools, solve prevalent health disease issues in Nigeria, and reduce the mortality rate caused by infectious diseases. Thus, propelling sustainable research skills that will leave a long-lasting impact in the health sector in my community, Nigeria, sub-Saharan Africa, and eventually, globally.

Ajoke23 commented 3 months ago

WEEK 2 TASK 1 Model selected - eos2ta5 Repository created - here @DhanshreeA @GemmaTuron Pls I appreciate any feedback.

TASK 2: MODEL REPRODUCIBILITY A well-detailed explanation can be found in the repository above.

IMPLEMENTATION OF THE AUTHORS MODEL I took the following step in implementation the author source code using Ubuntu terminal

  1. I already had conda dependencies installed

  2. Set up the cardiotox package on conda environment

    # create a conda environment
    conda create -n cardiotox python=3.7.7
    # activate the environment
    conda activate cardiotox
  3. Installing of PyBioMed

    cd cardiotox
    cd PyBioMed
    python setup.py install
  4. return back to the home cd ..

  5. Installing the package's version the authors used

    pip install tensorflow==2.3.1
    pip install sklearn==0.0
    pip install mordred==1.2.0
    pip install pybel==0.14.10
    pip install keras==2.4.3
  6. Testing the model python test.py

OUTPUT: Author's Source Code Output

RESULT COMPARISON OF THE CARDIOTOX & eos2ta5 MODEL

MODEL | MCC | NPV | ACC | PPV | SPE | SEN | B-ACC -- | -- | -- | -- | -- | -- | -- | -- eos2ta5 | 0.599 | 0.688 | 0.818 | 0.893 | 0.786 | 0.833 | 0.810 Cardiotox | 0.599 | 0.688 | 0.810 | 0.893 | 0.786 | 0.833 | 0.810

MOODEL | MCC | NPV | ACC | PPV | SPE | SEN | B-ACC -- | -- | -- | -- | -- | -- | -- | -- eos2ta5 | 0.452 | 0.688 | 0.683 | 0.455 | 0.6 | 0.909 | 0.754 Cardiotox | 0.452 | 0.688 | 0.755 | 0.455 | 0.6 | 0.909 | 0.755

Ajoke23 commented 3 months ago

Hi @DhanshreeA I can see that you reacted to my comment here. Any feedback so far? My week 2 task 2 (model reproducibility is still in progress) but I will appreciate if any feedback is given regarding week 2 (task 1)

Ajoke23 commented 3 months ago

Hi @DhanshreeA, I have completed the task 2 (week 2) model reproducibility task. I look forward to your feedback and advice so as to transfer the same knowledge to other contributors when helping each other.

A well detailed readme concerning Week 2 (both tasks 1 & 2) can be found here

Ajoke23 commented 3 months ago

Hi @DhanshreeA I am yet to get feedback regarding task 2. Please, I would appreciate any feedback so far from you.Your feedback is dependent on Week 3 task. Thank you for your time. I am looking forward to getting feedback from you

DhanshreeA commented 3 months ago

Hi @Ajoke23, thank you for your patience, and great progress so far! Few comments:

  1. I see that you are using rdkit module to standardize smiles, whereas this is only creating canonical smiles. To standardize, you have to use the standardizer module as shown through the example code in the model validations template provided by us.
  2. I don't see an implementation of the model from the paper, rather only the use of the ersilia implementation. Any reason why that's the case? Please point me to it if I have missed it.
Ajoke23 commented 3 months ago

Hi @Ajoke23, thank you for your patience, and great progress so far! Few comments:

  1. I see that you are using rdkit module to standardize smiles, whereas this is only creating canonical smiles. To standardize, you have to use the standardizer module as shown through the example code in the model validations template provided by us.
  2. I don't see an implementation of the model from the paper, rather only the use of the ersilia implementation. Any reason why that's the case? Please point me to it if I have missed it.

Hi @DhanshreeA. Thank you for the feedback.

  1. I have made corrections to that. I had to import the standardise module. I added this line of code from standardised import standardise. With this I had to rerun all my code and uploaded the new ipynb file to the repository

  2. It was an oversight. I thought I added it. I have updated it here and also in the readme section of my repository created

Ajoke23 commented 3 months ago

Hi @DhanshreeA Can I move to week 3 now since I have made the necessary changes based on your feedback? I look forward to receiving feedback from you. Thank you for your time.

GemmaTuron commented 3 months ago

Thansk @Ajoke23 we will provide feedback within the day today!

Ajoke23 commented 3 months ago

Thansk @Ajoke23 we will provide feedback within the day today!

Thank you so much @GemmaTuron. I look forward to the getting the feedback.

DhanshreeA commented 3 months ago

Hi @Ajoke23, this looks good to me. Please proceed to task 3, and I will review it on Monday. Thereafter you can work on your final application.

Ajoke23 commented 3 months ago

Hi @Ajoke23, this looks good to me. Please proceed to task 3, and I will review it on Monday. Thereafter you can work on your final application.

Alright @DhanshreeA . Thanks a lot for the feedback. I will start immediately

Ajoke23 commented 3 months ago

WEEK 3: VALIDATE MODEL IN A WILD Link to the code Dataset Used: Was gotten from this a publication Github page which can be found here

Data Leakage In this task, I ensure that inchikey present in the experimental dataset (used for performance evaluation) is not included in the training dataset used to build the predictive model. In the process of data leakage, The number of molecules from external datasets present in training datasets is: 7740. I dropped those leaked data. For accuracy and model performance, it's advisable to always remove leaked data to avoid model biases and to improve model performance. Thus, making your evaluation dataset independent from training dataset. The sources of the data leakage were the public repositories where the dataset was obtained. Both the experimental dataset and training dataset used by the author were gotten from Chembl, Pubchem.

Summary Statistics <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | Molecules -- | -- Training Data | 12620 Validation Data | 870

EVALUATION METRICS The model falls under the classification type. So, I used several evaluation metrics that are used in classification model. The evaluation metrics used include: MCC, NPV, PPV, ACC, SEN, SPE, B-ACC & AUROC curve.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Data | Model | MCC | NPV | ACC | PPV | SPE | SEN | B-ACC | AUC SCORE -- | -- | -- | -- | -- | -- | -- | -- | -- | -- Validation Dataset | eos2ta5 | 0.326 | 0.573 | 0.661 | 0.748 | 0.693 | 0.639 | 0.666 | 0.7

The PPV & NPV metrics indicating better performance in certain aspects of classification while the remaining evaluation metrics suggest moderate performance.AUC score of 0.70 also proves that the model has a predicting cability to distinguish between drug that are hERG blocker and hERG non-blocker. From this evaluation metrics, it shows that the model performed moderately well and have the predicting ability to identify hERG blocker.

@DhanshreeA I have update the week 3 task here. Please, can you have a look at it?

DhanshreeA commented 3 months ago

Hi @Ajoke23 any updates to report?

Ajoke23 commented 3 months ago

Hi @Ajoke23 any updates to report?

Yes, there is. It will be updated shortly. My system got crashed during the weekend. I have fixed my laptop and I am in the process creating auroc curve and evaluating the experimental dataset based on various evaluation metrics

Would you be chanced to give me a feedback before the end of today? Pls, I will really appreciate if this is possible. Thank you

Ajoke23 commented 3 months ago

Hi @DhanshreeA @GemmaTuron I have gone ahead to start preparing for my final application. My project template is in progress and will be completed before the end of today. I have a question, my week 3 was updated today. Can I still go ahead to create a final application without a feedback? Will it affect my chances of getting accepted because I updated the week 3 task today.