Closed dzumii closed 3 months ago
I encountered a "No module found: rdkit" error while trying to fetch eos3b5e but I was able to resolve it with pip install rdkit
. Eventually, I was able to install all the dependencies, Ersilia and test the simple models.
Well-done @dzumii.
Thank you @Ajoke23
After I completed my bachelor's degree in Microbiology, I developed an interest in the technology industry. This pushed me to start learning. I started with user interface design but soon discovered I was not really good with combining colors and all that. I was driven to go back and research other fields in technology, and I discovered data science. I was fascinated by how data can be used to generate insights that can drive growth. I acquired relevant skills and certifications in data science and started looking for a job opportunity. Unfortunately, because the tech field is dominated by men and organizations in my home country would also prefer someone with a background in computer science related courses, I couldn't secure a job on time. Persistence, they say, wears out resistance. Because I persisted, I was able to get an internship role as a data analyst. The systemic, cultural, and gender bias I faced during my job search and even while at work was one of the reasons I applied to Outreachy.
Within two years on this job, I acquired many skills in data cleaning, analysis, visualization, and reporting. I was soon converted to a permanent staff member and eventually made the team leader. Along the way, I received a scholarship from the World Bank for a postgraduate study in Bioinformatics. I thought it would be good to use my analytical skills to analyze biological data. So I accepted the offer. As a postgraduate student in bioinformatics, I learned to use computational tools to analyze DNA, RNA, and protein sequences. Because I encountered a lot of CLI (command line interface) tools and some work best on Linux, I learned to use WSL (as a Windows user). I also had to use Docker images a lot. I also learned about machine learning and AI approaches used in diagnosis, modeling, and drug discovery. I have worked on some machine learning projects, including regression, classification, clustering, and even deep learning cases.
After I was selected for the contribution phase, I went through the project list and couldn't find one that matched my interest until I got to Ersilia. Ersilia’s quest to make AI/ML models accessible, especially to low and middle income countries, thereby lowering the barrier to drug discovery and supporting researchers of these neglected diseases, is really commendable, and I would love to be a part of that. With a solid foundation in data analytics and machine learning, coupled with my background in microbiology and bioinformatics, I believe I could contribute my skills to Ersilia’s quest.
My plans during the internship are to comb through Ersilia’s model hub, understand the methodology, and help incorporate more models. I am aware I will encounter errors and roadblocks. I am ready for the challenge and also believe this will help to sharpen my machine learning skills. After the internship,I hope to keep contributing to the Ersilia Model Hub and eventually transition to the Machine Learning/Artificial Intelligence Engineer career path.
My submission for Week 2 Task 1 and a detailed explanation can be found here. I selected one of the models, downloaded 1000 molecules represented in SMILES format, and made predictions with the model. I await feedback and corrections so I can proceed to Task 2. @DhanshreeA @GemmaTuron
Hi @dzumii good work so far! Few comments:
@DhanshreeA Thank you so much for taking your time to review what I have done. I will get started on the corrections.
@DhanshreeA I have made a few changes
If there's still something I haven't done well, I would appreciate your feedback. I look forward to your response, so I can confidently help others with their tasks. Thank you
Hi @dzumii great work so far, thank you for the updates! Some comments:
Can you explain the function:
def calc_stats(pred_array, true_array, insol_thresh=-6, sol_thresh=-4):
'''
This function will calculate the following on the predicted array:
Hit% = #correct(lower_sol_thresh,upper_sol_thresh) / #(lower_sol_thresh,upper_sol_thresh)
Fail% = #true(insol_thresh)pred(lower_sol_thresh,upper_sol_thresh) / #pred(lower_sol_thresh,upper_sol_thresh)
Assumptions: pred_array,true_array are paired numpy arrays.
'''
#first we need to access the examples which have true in (lower_sol_thresh, upper_sol_thresh)
true_mask=(true_array > sol_thresh)
#calculating the Hit%
num_true=len(true_array[true_mask])
poss_hits=pred_array[true_mask]
num_hits=np.sum((poss_hits>sol_thresh))
hit=num_hits/float(num_true)
#calculating the Fail%
pred_mask=(pred_array > sol_thresh)
insol_mask=true_array <= insol_thresh
fail=np.sum(insol_mask & pred_mask) / float(np.sum(pred_mask))
return hit,fail,np.sum(true_mask),np.sum(pred_mask)
Eg, I don't see the lower_sol_thresh and upper_sol_thresh anywhere?
Thank you @DhanshreeA
Hello @DhanshreeA , Concerning the third point you raised, I have properly renamed my notebooks and merged Week2_Task2_2.ipynb and Week 2_Task2_3 into Model_Reproducibility.ipynb and Week2_Task1_3to4 have been renamed to Model_Bias, so it is easy for you to understand and navigate the repository
@DhanshreeA concerning the first point you raised, the notebook(now renamed to Model_Bias) has been updated to include MSE and RMSE values for the dataset. Find the repository here
Hi @dzumii great progress so far. Have you managed to find an external dataset for model validation? If you have not, that is fine. I would suggest only spending time on this till Monday when we will review the issue finally for updates. At that point you can create your final application.
@DhanshreeA Yes I have found an external dataset with experimental values for the model validation. I was waiting for you to give me a go ahead. I will proceed and validate the model. Thank you for your time and efforts so far
Hi @dzumii
Great, as we are approaching the end of the application period, please focus on this last task:
And then start working in your final application, do not open new tasks in order to have time to make a strong application. Thanks!
Well noted. Thank you @GemmaTuron
@DhanshreeA @GemmaTuron I have validated an external dataset I found in literature and the coefficient of determination (R2) is 0.88. I ran a check to remove the molecules present in the training dataset from the external dataset to be sure there was no data leakage. I then ran predictions with the eos6oli model and calculated the R2 score (because the target output is a continuous numerical value). I have uploaded the notebooks and dataset used and explained the whole process in the ReadME file in the repository
Hi @dzumii
Good job and nicely explained, thanks. As an extension task, you might want to plot in a PCA the two sets of molecules (training set and external validation set) to see if they overlap. You can simply use morgan fingerprints and apply a PCA to 2 dimensions from the sklearn package
Also start working on your final application
@GemmaTuron Thank you for your time and feedback. I will try my hands on your suggestion while I also start working on my final application.
Hi @dzumii good job throughout. Few comments:
This is just advice for the future, no need to work on this right now. Please prioritize on submitting your final application. Hope you had fun learning! 🥳
Thank you for taking the time to review my submission again, @DhanshreeA . I was still trying to read more and investigate so I could understand and interpret the PCA result well before reporting back to you on the PCA task. As advised, I will focus on my final application now, and as soon as that is sorted out, I will make adjustments and respond to your feedback.
Hello @DhanshreeA I have taken time to revisit your feedback on my last submission. To address point 1 and 2, i had to redo the PCA plot using inchikey instead of smiles and there was still an overlap. I tried to read more about Morgan fingerprints and PCA plots, and I came across a publication where the authors went further to calculate the Tanimoto similarity after the PCA plot overlapped. So I calculated the tanimoto similarity. Even though the validation and training sets overlapped, the average tanimoto similarity between them was 0.085. The Tanimoto similarity between the molecules was mostly distributed between 0 and 0.4, which indicated that the molecules used in validation were obviously structurally diverse from those in the training set. This indicated that the validation dataset is indeed suitable for the validation. So, I think its a good validation. Please correct me if I am wrong or missing anything.
I also addressed point number 3 and added summary statistics to my notebook.
Thank you again, it has been a nice time learning with Ersilia.
Week 1 - Get to know the community
Week 2 - Get Familiar with Machine Learning for Chemistry
Week 3 - Validate a Model in the Wild
Week 4 - Prepare your final application