ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
220 stars 147 forks source link

✍️ Contribution period: Shreya #642

Closed alt-shreya closed 1 year ago

alt-shreya commented 1 year ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

GemmaTuron commented 1 year ago

Hi @alt-shreya

Welcome to Ersilia! Please, make sure to complete week 1 tasks before moving on to week 2, there is still time to catch up!

alt-shreya commented 1 year ago

@GemmaTuron thank you! Out of curiosity, does this page of the docs explain tasks we need to perform in Week 2? I wanted to complete the third task for Week 1 (test the simplest model) and that's when I got a little confused.

alt-shreya commented 1 year ago

GemmaTuron I ran into an error, and need your help

I'm using Fedora 37 and trying to test the simplest model (Week 1 Task 4).

In order to fetch the model, I executed this command:

ersilia fetch retrosynthetic-accessibility 

and got this error:

/bin/sh: line 1: shasum: command not found
Command '<<address of the file>> shasum -a 256 data.h5;' returned non-zero exit status 127. 

I also checked the log files in order to understand where the error may have been, but the log file is empty.

EDIT: I realised that I was fetching a different model . On the official contribution guide for this year, it clearly details the step to run and test the models.

Completed instructions and calculated the molecular weight of C4 as 58.124

Screenshot from 2023-03-15 18-53-21

GemmaTuron commented 1 year ago

Hi @alt-shreya

Do you have git-lfs installed? it seems you were not able to download the .h5 file initially

alt-shreya commented 1 year ago

My Motivation to Work at Ersilia

Dear Gemma Turon,

In order to explain my motivation, I’d like to start this letter with an anecdote. When I was in kindergarten, my school conducted a free health checkup for the students. This was the first time I had been to a clinic. And I was terrified. To me, the unknown doctors behind their masks seemed like the minions of an unknown mastermind. Eighteen years later, I am glad to report the fear I once held has given its way to a sense of deep respect for the healthcare industry. That is my primary motivation behind applying for this internship.

As an aspiring AI professional, I am eager to gain direct exposure to the ML industry while contributing to a cause that I am passionate about – making medical research more accessible. I resonate with your accessible and open approach to research, as well as your commitment to building sustainable collaborations with leading local entities. I believe that collaboration is essential to achieving breakthroughs in science and technology, and I am highly enthusiastic to join an open-source software community that shares this value.

Throughout my academic and personal journey, I have developed a strong interest in machine learning and artificial intelligence. During my Bachelor’s degree, I was determined to make the most of this interest to make an impact in the healthcare industry. I led a team of four to build a prototype for a contactless authentication device for a post-COVID world. This experience taught me important lessons on making technology accessible, combining technical skills with a passion for solving real-world problems, perseverance, and leadership.

Growing up in India, I have always understood the importance of making science affordable and accessible to everyone. As an intern at the Ersilia Open Source Initiative, I believe I would have the opportunity to contribute to this cause while honing my skills in machine learning under the guidance of experienced mentors. I am detail-oriented and have honed my research skills through hours of academic work. Additionally, I also volunteered extensively during the COVID-19 pandemic to help raise funds for those in need. Having graduated into the workforce under extraordinary circumstances, I quickly learnt to thrive in remote work environments and enjoy collaborating with a diverse group of people.

In conclusion, I am honoured to have the opportunity to apply for the internship position at Ersilia Open Source Initiative, and I believe that my experience, skills, and passion align well with the values and goals of your non-profit organisation. Thank you for considering my application. I would be happy to discuss my candidature in detail via the following platforms:

Email: shreyakumar31@gmail.com LinkedIn: https://www.linkedin.com/in/alt-shreya GitHub: https://www.github.com/alt-shreya

alt-shreya commented 1 year ago

Week 2 Log

Main Tasks

Things that Motivate me

These tasks provide a hands-on glimpse of my responsibilities during the internship; finding, testing, debugging and implementing open source code within Ersilia.

Task Report

Task 1: Selecting a Model

Installing

conda create --name STOUT python=3.8 
conda activate STOUT
conda install -c decimer stout-pypi

the last step resulted in an error as there were some incompatible packages. So I decided to try another approach: installing repo from Git, like so:

pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git

This successfully built STOUT.

However, when I tried to import Python, it gave me a module not found error. I realised I had to install the dependencies independently, something I had inadvertently missed out on. I install the requirements using

pip install -r Python_requirements.txt

This approach also led to some errors. On reading the error descriptions, I understood it is due to the proper version of Tensorflow not being installed. To fix this, I installed Tensorflow out of my conda environment.

As of writing this, TensorFlow is still installing on my system. I will update this space here with further inputs.

Update1: It failed to install due to Compatibility Errors

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.36=0
  - feature:|@/linux-64::__glibc==2.36=0

Your installed version is: 2.36

Possible Solution

I think this error arises because conda is bad at reporting conflicts. It is possible that some package in STOUT has a discrepant version, which leads to these errors. I will get to the bottom of this.

Next Steps

My next steps would be to

GemmaTuron commented 1 year ago

Hi @alt-shreya

Good start, let's complete these tasks before moving to week 3!

alt-shreya commented 1 year ago

@GemmaTuron thank you! Although my progress is slow and steady, I'm enthusiastic about the way forward!

Here are some updates:

Update 2: Resolving BadZipError

I added my errors to a log file:
bad_zip_error.txt

Update 3: Resolving Java Exception Error

on running the following command, IUPAC_name = translate_forward(SMILES)

I run into the following Java Exception Java_exception_error.txt

I think this has something to do with the Smiles Generator, but I'm not sure how to solve it.

This is the version of Java I'm runnig on my Fedora:

openjdk 17.0.4.1 2022-08-12
OpenJDK Runtime Environment (Red_Hat-17.0.4.1.1-3.fc37) (build 17.0.4.1+1)
OpenJDK 64-Bit Server VM (Red_Hat-17.0.4.1.1-3.fc37) (build 17.0.4.1+1, mixed mode, sharing)

@ZakiaYahya @masroor07 did you face similar errors while running the commands?

masroor07 commented 1 year ago

Hey @alt-shreya, Are you providing the right input to the function when you calling it?

alt-shreya commented 1 year ago

@masroor07 yes

GemmaTuron commented 1 year ago

Hi @alt-shreya

This line in the error log seems to indicate you are not passing the right input smiles:An InChI could not be generated and used to canonise SMILES: null

alt-shreya commented 1 year ago

@GemmaTuron I used this command to write my input:


SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)```
masroor07 commented 1 year ago

@GemmaTuron I used this command to write my input:

SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"

IUPAC_name = translate_forward(SMILES)```

Make sure it is a valid smile

@GemmaTuron I used this command to write my input:

SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)```

Could you try running it for a different SMILE from the eml file and check whether you get the same error?

alt-shreya commented 1 year ago

Thank you @masroor07 and @GemmaTuron, I tried the same function with some other SMILES formulae, and they worked. I will be updating this space with my progress now.

alt-shreya commented 1 year ago

Update 4: Predicting EML file

After a little help from my friends who also happened to be working on this model, I finally made my first translation. Once again, it was the compound vanillin. Following is the code I entered:

from STOUT import translate_forward, translate_reverse
import pandas as pd
import numpy as np

SMILES = "SMILES = 'c1(C=O)cc(OC)c(O)cc1'
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

# IUPAC name to SMILES translation

IUPAC = "4-hydroxy-3-methoxybenzaldehyde"
SMILES = translate_reverse(IUPAC)
print("SMILES of "+IUPAC_name+" is: "+SMILES)

Here is the file with the outputs Output.txt

Then, I had only to perform the same tasks on all 422 rows of the eml_canonical.csv file. In order to do so, here are the steps I undertook:

Running Predictions for the EML

# converting dataframe to a list of strings

df = pd.read_csv('/filepath/eml_canonical.csv')
eml_smiles = df["smiles"]
eml_smiles_list= eml_smiles.tolist()
IUPAC_list = []

# for loop to convert every entry in the CSV file to an IUPAC name
for smile in eml_smiles_list:
  IUPAC_name = translate_forward(smile)
  IUPAC_list.append(IUPAC_name)

# converting IUPAC string to a CSV
new_dict = {"smiles": eml_smiles_list, "IUPAC": IUPAC_list}
output = pd.DataFrame(new_dict)
output.to_csv('STOUT_OUT.csv')

As of now, the code is still running, perhaps due to the sheer number of entries in the file. More updates to follow:

My system ran out of memory. Will update with fewer entries.

Next Steps

If this doesn't get resolved in a few hours, I will create a separate file with fewer entries and run that through my for loop. This ought to give me a sample of outputs, enough to build an understanding of the STOUT model.

masroor07 commented 1 year ago

Thank you @masroor07 and @GemmaTuron, I tried the same formula with some other SMILES formulae, and they worked. I will be updating this space with my progress now. Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.

masroor07 commented 1 year ago

Thank you @masroor07 and @GemmaTuron, I tried the same formula with some other SMILES formulae, and they worked. I will be updating this space with my progress now.

Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.

alt-shreya commented 1 year ago

Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.

@masroor07 I think it's some form of caffeine, but yeah, it doesn't exist in the EML file. Probably because it is not an essential medicine?

The SMILES for caffeine is this: CN1C=NC2=C1C(=O)N(C(=O)N2C)C

alt-shreya commented 1 year ago

@GemmaTuron I am happy to report new updates after running the file through Google Colab:

Update 4.1: Running Predictions on EML File

I performed the same code on Colab and here are the results: STOUT_OUT_fwd.csv

The output file contains a Pandas dataframe with the regular SMILES form of the compound and its IUPAC nomenclature. I performed a similar experiment for the reverse translation as well. While I was waiting for the results to get processed, I was curious to know why this is considered a translation process and not a conversion process. I read up a bit and discovered that SMILES are treated as a word, from the point of view of formal language. This means it is intuitive to perform language processing on them as a word.

Next Steps:

alt-shreya commented 1 year ago

Update 5: Comparing the files

Presenting my findings in this table

SMILES Ersilia Model STOUT Model
CC(=O)NC@@HC(O)=O (2R)-2-acetamido-3-sulfanylpropanoicacid;;; (2R)-2-acetamido-3-sulfanylpropanoicacid;;;
CC(=O)Oc1ccccc1C(O)=O 2-acetyloxybenzoicacid;;; 2-acetyloxybenzoicacid;;;
NC1=NC(=O)c2ncn(COCCO)c2N1 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one;;; 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one;;;
NC(N)=NC(=O)c1nc(Cl)c(N)nc1N ;3,5-diamino-2-chloro-N-(diaminomethylidene)-2H-pyrazine-6-carboxamide;;; 3,5-diamino-6-chloro-N-(diaminomethylidene)pyrazine-2-carboxamide;;;
CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3 2-butyl-3-[4-[2-(diethylamino)ethoxy]-3,5-diiodocyclohexa-1,4-dien-1-yl]chromen-4-one;;; (2-butyl-1-benzofuran-3-yl)-[4-[2-(diethylamino)ethoxy]-3,5-diiodophenyl]methanone;;
GemmaTuron commented 1 year ago

Hi @alt-shreya Good job, let's move to week 3 tasks!

alt-shreya commented 1 year ago

Proposed Model 1: DTI2VEC

Model Name:

DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning

Model Summary

DTi2Vec identifies DTIs using network representation learning and ensemble learning techniques. It constructs the heterogeneous network, and then it automatically generates features for each drug and target using the nodes embedding technique. DTi2Vec is a simple yet effective method that provides high DTI prediction performance while being scalable and efficient in computation, translating into a powerful drug repositioning tool.

Datasets

Slug: dti2vec

Tag: machine-learning, cheminformatics, drug-repurposing, node2vec-embeddings, boosting-ensemble

Task: Classification

Package Dependencies:

Publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459562/

Reference:

Thafar MA, Olayan RS, Albaradei S, Bajic VB, Gojobori T, Essack M, Gao X. DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning. J Cheminform. 2021 Sep 22;13(1):71. doi: 10.1186/s13321-021-00552-w. PMID: 34551818; PMCID: PMC8459562.

Source Code Github Respository

License: Creative Commons

alt-shreya commented 1 year ago

Proposed Model 2: DeepCC

Model Name: DeepCC

Model Summary: a novel deep learning-based framework for cancer molecular subtype classification

Model Description:

A novel supervised cancer classification framework, deep cancer subtype classification (DeepCC), based on deep learning of functional spectra quantifying activities of biological pathways. DeepCC provides a novel cancer classification framework that is platform independent, robust to missing data, and can be used for single sample prediction facilitating clinical implementation of cancer molecular subtyping.

Datasets

Slug: deepcc

Tag: molecular subtyping, individualized therapy

Task: Classification

Language: R, Python

Package Dependencies

Publication

Source Code

Github Respository

Reference

Gao, F., Wang, W., Tan, M. et al. DeepCC: a novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis 8, 44 (2019). https://doi.org/10.1038/s41389-019-0157-8

License

MIT License

alt-shreya commented 1 year ago

Proposed Model 3: LOTUS

Model Name: LOTUS

Model Summary

LOTUS is a new computational approach to identify genes with high oncogenic potential. It implements a machine learning approach to learn an oncogenic potential score from known driver genes, and brings two novelties compared to existing methods.

Model Description

LOTUS is a machine-learning based approach which allows to integrate various types of data in a versatile manner, including information about gene mutations and protein-protein interactions. In addition, LOTUS can predict cancer driver genes in a pan-cancer setting as well as for specific cancer types, using a multitask learning strategy to share information across cancer types.

Datasets

Slug: lotus-driver-genes Tag: Mutation databases, Gene prediction, Machine learning Task: Classification Language: R

Package Dependencies: none listed

Publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007381

Source Code: Github Respository

Reference

O. Collier, V. Stoven and J.-P. Vert, LOTUS: a Single- and Multitask Machine Learning Algorithm for the Prediction of Cancer Driver Genes, Preprint. doi: https://doi.org/10.1101/398537

License: Creative Commons Attribution License

GemmaTuron commented 1 year ago

Hi @alt-shreya

Thanks for these suggestions, these are good models but mostly focused on cancer, which is out of the scope for us at this moment. Let's wrap up by preparing your final application