ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

Contribution period: ADEYEMI Jumoke Olantun #982

Closed dzumii closed 3 months ago

dzumii commented 4 months ago

Week 1 - Get to know the community

Week 2 - Get Familiar with Machine Learning for Chemistry

Week 3 - Validate a Model in the Wild

Week 4 - Prepare your final application

dzumii commented 4 months ago

I encountered a "No module found: rdkit" error while trying to fetch eos3b5e but I was able to resolve it with pip install rdkit. Eventually, I was able to install all the dependencies, Ersilia and test the simple models. docker model fetched Screenshot 2024-03-05 155659

Ajoke23 commented 4 months ago

Well-done @dzumii.

dzumii commented 4 months ago

Thank you @Ajoke23

dzumii commented 4 months ago

After I completed my bachelor's degree in Microbiology, I developed an interest in the technology industry. This pushed me to start learning. I started with user interface design but soon discovered I was not really good with combining colors and all that. I was driven to go back and research other fields in technology, and I discovered data science. I was fascinated by how data can be used to generate insights that can drive growth. I acquired relevant skills and certifications in data science and started looking for a job opportunity. Unfortunately, because the tech field is dominated by men and organizations in my home country would also prefer someone with a background in computer science related courses, I couldn't secure a job on time. Persistence, they say, wears out resistance. Because I persisted, I was able to get an internship role as a data analyst. The systemic, cultural, and gender bias I faced during my job search and even while at work was one of the reasons I applied to Outreachy.

Within two years on this job, I acquired many skills in data cleaning, analysis, visualization, and reporting. I was soon converted to a permanent staff member and eventually made the team leader. Along the way, I received a scholarship from the World Bank for a postgraduate study in Bioinformatics. I thought it would be good to use my analytical skills to analyze biological data. So I accepted the offer. As a postgraduate student in bioinformatics, I learned to use computational tools to analyze DNA, RNA, and protein sequences. Because I encountered a lot of CLI (command line interface) tools and some work best on Linux, I learned to use WSL (as a Windows user). I also had to use Docker images a lot. I also learned about machine learning and AI approaches used in diagnosis, modeling, and drug discovery. I have worked on some machine learning projects, including regression, classification, clustering, and even deep learning cases.

After I was selected for the contribution phase, I went through the project list and couldn't find one that matched my interest until I got to Ersilia. Ersilia’s quest to make AI/ML models accessible, especially to low and middle income countries, thereby lowering the barrier to drug discovery and supporting researchers of these neglected diseases, is really commendable, and I would love to be a part of that. With a solid foundation in data analytics and machine learning, coupled with my background in microbiology and bioinformatics, I believe I could contribute my skills to Ersilia’s quest.

My plans during the internship are to comb through Ersilia’s model hub, understand the methodology, and help incorporate more models. I am aware I will encounter errors and roadblocks. I am ready for the challenge and also believe this will help to sharpen my machine learning skills. After the internship,I hope to keep contributing to the Ersilia Model Hub and eventually transition to the Machine Learning/Artificial Intelligence Engineer career path.

dzumii commented 4 months ago

My submission for Week 2 Task 1 and a detailed explanation can be found here. I selected one of the models, downloaded 1000 molecules represented in SMILES format, and made predictions with the model. I await feedback and corrections so I can proceed to Task 2. @DhanshreeA @GemmaTuron

DhanshreeA commented 4 months ago

Hi @dzumii good work so far! Few comments:

  1. Instead of just keeping top 1000 molecules from the ChEMBL dataset you have sourced, try sampling it randomly just to make sure we are not introducing any bias in the dataset accidentally.
  2. You have made two scatter plots (index v solubility, and SMILE v solubility). This is not how scatterplots work. Scatter plots should be plotted for data where both your x and y components are numerical and actually hold any meaning. So plotting it against SMILES is not right.
  3. Can you comment on the results you have obtained with respect to the distribution of logS values from the model and the dataset you have found?
dzumii commented 4 months ago

@DhanshreeA Thank you so much for taking your time to review what I have done. I will get started on the corrections.

dzumii commented 3 months ago

@DhanshreeA I have made a few changes

  1. I tried to find another datasets that have experimentally determined solubility values so I can compare with the predictions from the model in a scatter and residual plot.
  2. I also went further to reproduce outputs and figures from the publication to check it's reproducibility and commented on that too.

If there's still something I haven't done well, I would appreciate your feedback. I look forward to your response, so I can confidently help others with their tasks. Thank you

DhanshreeA commented 3 months ago

Hi @dzumii great work so far, thank you for the updates! Some comments:

  1. Can you report here in addition to the MAE, what values do you get for MSE/RMSE for the dataset with experimental values?
  2. Can you explain the function:

    def calc_stats(pred_array, true_array, insol_thresh=-6, sol_thresh=-4):
    '''
    This function will calculate the following on the predicted array:
        Hit% = #correct(lower_sol_thresh,upper_sol_thresh) / #(lower_sol_thresh,upper_sol_thresh)
        Fail% = #true(insol_thresh)pred(lower_sol_thresh,upper_sol_thresh) / #pred(lower_sol_thresh,upper_sol_thresh)
    
    Assumptions: pred_array,true_array are paired numpy arrays.
    '''
    
    #first we need to access the examples which have true in (lower_sol_thresh, upper_sol_thresh)
    true_mask=(true_array > sol_thresh)
    
    #calculating the Hit%
    num_true=len(true_array[true_mask])
    poss_hits=pred_array[true_mask]
    num_hits=np.sum((poss_hits>sol_thresh))
    hit=num_hits/float(num_true)
    
    #calculating the Fail%
    pred_mask=(pred_array > sol_thresh)
    insol_mask=true_array <= insol_thresh
    fail=np.sum(insol_mask & pred_mask) / float(np.sum(pred_mask))
    
    return hit,fail,np.sum(true_mask),np.sum(pred_mask)

    Eg, I don't see the lower_sol_thresh and upper_sol_thresh anywhere?

  3. What is the difference between Week2_Task2_2 and Week 2_Task2_3 notebooks? I see the plots are same, so not sure I understood.
dzumii commented 3 months ago

Thank you @DhanshreeA

  1. MSE: 1.28,RMSE: 1.13 are the values I got comparing the experimental values to the predicted values. I will upload an updated version of the notebook to my repository if that is fine.
  2. I referenced the authors' figure generation codes as found on their github repository. The codes were used to generate the distrubution of predictions in an histogram, sensitivity and false discovery rate charts. The calc_stats function takes four parameters. They are 'pred_array' (predicted values array), 'true_array' (experimental values array), 'insol_thresh' (threshold value for insolubility ,default at -6) and 'sol_thresh' (threshold value for solubility, -4 by default). The function calculates hit rate( percentage of correct predictions) and fail rate (percentage of incorrect predictions) based on the set threshold.
  3. In Week2_Task2_2, I used soltranet,the authors model to run predictions on the SC2 dataset they used and reproduced their results, which turned out the same as what i saw in the paper indicating the model is reproducible. In Week 2_Task2_3, i used eos6oli from Ersilia model hub to reproduce the same output using the same dataset and the ouput was also the same. Do you want me to merge the two notebooks? I separated them by tasks.
dzumii commented 3 months ago

Hello @DhanshreeA , Concerning the third point you raised, I have properly renamed my notebooks and merged Week2_Task2_2.ipynb and Week 2_Task2_3 into Model_Reproducibility.ipynb and Week2_Task1_3to4 have been renamed to Model_Bias, so it is easy for you to understand and navigate the repository

dzumii commented 3 months ago

@DhanshreeA concerning the first point you raised, the notebook(now renamed to Model_Bias) has been updated to include MSE and RMSE values for the dataset. Find the repository here

DhanshreeA commented 3 months ago

Hi @dzumii great progress so far. Have you managed to find an external dataset for model validation? If you have not, that is fine. I would suggest only spending time on this till Monday when we will review the issue finally for updates. At that point you can create your final application.

dzumii commented 3 months ago

@DhanshreeA Yes I have found an external dataset with experimental values for the model validation. I was waiting for you to give me a go ahead. I will proceed and validate the model. Thank you for your time and efforts so far

GemmaTuron commented 3 months ago

Hi @dzumii

Great, as we are approaching the end of the application period, please focus on this last task:

And then start working in your final application, do not open new tasks in order to have time to make a strong application. Thanks!

dzumii commented 3 months ago

Well noted. Thank you @GemmaTuron

dzumii commented 3 months ago

@DhanshreeA @GemmaTuron I have validated an external dataset I found in literature and the coefficient of determination (R2) is 0.88. I ran a check to remove the molecules present in the training dataset from the external dataset to be sure there was no data leakage. I then ran predictions with the eos6oli model and calculated the R2 score (because the target output is a continuous numerical value). I have uploaded the notebooks and dataset used and explained the whole process in the ReadME file in the repository

GemmaTuron commented 3 months ago

Hi @dzumii

Good job and nicely explained, thanks. As an extension task, you might want to plot in a PCA the two sets of molecules (training set and external validation set) to see if they overlap. You can simply use morgan fingerprints and apply a PCA to 2 dimensions from the sklearn package

Also start working on your final application

dzumii commented 3 months ago

@GemmaTuron Thank you for your time and feedback. I will try my hands on your suggestion while I also start working on my final application.

DhanshreeA commented 3 months ago

Hi @dzumii good job throughout. Few comments:

  1. From your PCA plot it seems like there's quite some overlap between the validation dataset and the training dataset, ie they are quite similar. Do you think this is a good quality validation?
  2. When checking for overlapping smiles, inchikey is used instead of the SMILES string itself.
  3. Just an advice, it is very useful to provide summary statistic of the datasets when presenting results. Eg how many molecules in training data? How many in validation data? How many overlap.

This is just advice for the future, no need to work on this right now. Please prioritize on submitting your final application. Hope you had fun learning! 🥳

dzumii commented 3 months ago

Thank you for taking the time to review my submission again, @DhanshreeA . I was still trying to read more and investigate so I could understand and interpret the PCA result well before reporting back to you on the PCA task. As advised, I will focus on my final application now, and as soon as that is sorted out, I will make adjustments and respond to your feedback.

dzumii commented 3 months ago

Hello @DhanshreeA I have taken time to revisit your feedback on my last submission. To address point 1 and 2, i had to redo the PCA plot using inchikey instead of smiles and there was still an overlap. I tried to read more about Morgan fingerprints and PCA plots, and I came across a publication where the authors went further to calculate the Tanimoto similarity after the PCA plot overlapped. So I calculated the tanimoto similarity. Even though the validation and training sets overlapped, the average tanimoto similarity between them was 0.085. The Tanimoto similarity between the molecules was mostly distributed between 0 and 0.4, which indicated that the molecules used in validation were obviously structurally diverse from those in the training set. This indicated that the validation dataset is indeed suitable for the validation. So, I think its a good validation. Please correct me if I am wrong or missing anything.

I also addressed point number 3 and added summary statistics to my notebook.

Thank you again, it has been a nice time learning with Ersilia.