alt-shreya commented 1 year ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[X] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

GemmaTuron commented 1 year ago

Hi @alt-shreya

Welcome to Ersilia! Please, make sure to complete week 1 tasks before moving on to week 2, there is still time to catch up!

alt-shreya commented 1 year ago

@GemmaTuron thank you! Out of curiosity, does this page of the docs explain tasks we need to perform in Week 2? I wanted to complete the third task for Week 1 (test the simplest model) and that's when I got a little confused.

alt-shreya commented 1 year ago

GemmaTuron I ran into an error, and need your help

I'm using Fedora 37 and trying to test the simplest model (Week 1 Task 4).

In order to fetch the model, I executed this command:

ersilia fetch retrosynthetic-accessibility

and got this error:

/bin/sh: line 1: shasum: command not found
Command '<<address of the file>> shasum -a 256 data.h5;' returned non-zero exit status 127.

I also checked the log files in order to understand where the error may have been, but the log file is empty.

EDIT: I realised that I was fetching a different model . On the official contribution guide for this year, it clearly details the step to run and test the models.

Completed instructions and calculated the molecular weight of C4 as 58.124

Screenshot from 2023-03-15 18-53-21

GemmaTuron commented 1 year ago

Hi @alt-shreya

Do you have git-lfs installed? it seems you were not able to download the .h5 file initially

alt-shreya commented 1 year ago

My Motivation to Work at Ersilia

Dear Gemma Turon,

In order to explain my motivation, I’d like to start this letter with an anecdote. When I was in kindergarten, my school conducted a free health checkup for the students. This was the first time I had been to a clinic. And I was terrified. To me, the unknown doctors behind their masks seemed like the minions of an unknown mastermind. Eighteen years later, I am glad to report the fear I once held has given its way to a sense of deep respect for the healthcare industry. That is my primary motivation behind applying for this internship.

As an aspiring AI professional, I am eager to gain direct exposure to the ML industry while contributing to a cause that I am passionate about – making medical research more accessible. I resonate with your accessible and open approach to research, as well as your commitment to building sustainable collaborations with leading local entities. I believe that collaboration is essential to achieving breakthroughs in science and technology, and I am highly enthusiastic to join an open-source software community that shares this value.

Throughout my academic and personal journey, I have developed a strong interest in machine learning and artificial intelligence. During my Bachelor’s degree, I was determined to make the most of this interest to make an impact in the healthcare industry. I led a team of four to build a prototype for a contactless authentication device for a post-COVID world. This experience taught me important lessons on making technology accessible, combining technical skills with a passion for solving real-world problems, perseverance, and leadership.

Growing up in India, I have always understood the importance of making science affordable and accessible to everyone. As an intern at the Ersilia Open Source Initiative, I believe I would have the opportunity to contribute to this cause while honing my skills in machine learning under the guidance of experienced mentors. I am detail-oriented and have honed my research skills through hours of academic work. Additionally, I also volunteered extensively during the COVID-19 pandemic to help raise funds for those in need. Having graduated into the workforce under extraordinary circumstances, I quickly learnt to thrive in remote work environments and enjoy collaborating with a diverse group of people.

In conclusion, I am honoured to have the opportunity to apply for the internship position at Ersilia Open Source Initiative, and I believe that my experience, skills, and passion align well with the values and goals of your non-profit organisation. Thank you for considering my application. I would be happy to discuss my candidature in detail via the following platforms:

Email: shreyakumar31@gmail.com LinkedIn: https://www.linkedin.com/in/alt-shreya GitHub: https://www.github.com/alt-shreya

alt-shreya commented 1 year ago

Week 2 Log

Main Tasks

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[ ] Compare results with the Ersilia Model Hub implementation!

Things that Motivate me

These tasks provide a hands-on glimpse of my responsibilities during the internship; finding, testing, debugging and implementing open source code within Ersilia.

Task Report

Task 1: Selecting a Model

At first glance, STOUT seems really interesting. When I was in high school trying to understand IUPAC naming conventions, I wondered what it would be like to have an automatic name generator which could make life easy for me.
For context, IUPAC names are a standard set of names for compounds decided by the Internatinoal Union of Pure and Applied Chemistry. Under the IUPAC convention, the nomenclature of vanillin, the compound that lends vanilla essence its flavour, would be written as 4-hydroxy-3-methoxybenzaldehyde.
SMILES strings are a line notation that represent the structure of the compound. The SMILES for the compound vanillin would be c1(C=O)cc(OC)c(O)cc1
Aside from personal life, I think this could be significant in minimising errors in drug discovery.
I am curious and highly motivated to implement this.
Taking a deeper dive into the research paper, I am intrigued by their approach towards representing chemical entities and string conversion.
I'd like to see this in action.

Installing

Step1: created a separate conda environment by running the following steps:

conda create --name STOUT python=3.8 
conda activate STOUT
conda install -c decimer stout-pypi

the last step resulted in an error as there were some incompatible packages. So I decided to try another approach: installing repo from Git, like so:

pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git

This successfully built STOUT.

However, when I tried to import Python, it gave me a module not found error. I realised I had to install the dependencies independently, something I had inadvertently missed out on. I install the requirements using

pip install -r Python_requirements.txt

This approach also led to some errors. On reading the error descriptions, I understood it is due to the proper version of Tensorflow not being installed. To fix this, I installed Tensorflow out of my conda environment.

As of writing this, TensorFlow is still installing on my system. I will update this space here with further inputs.

Update1: It failed to install due to Compatibility Errors

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.36=0
  - feature:|@/linux-64::__glibc==2.36=0

Your installed version is: 2.36

Possible Solution

I think this error arises because conda is bad at reporting conflicts. It is possible that some package in STOUT has a discrepant version, which leads to these errors. I will get to the bottom of this.

Next Steps

My next steps would be to

run the sample code provided on the STOUT documentation
run predictions for the EML
ask fellow applicants who selected this model about errors they faced.
ask how to log errors into a file from inside a python IDE inside terminal (Fedora 37)

GemmaTuron commented 1 year ago

Hi @alt-shreya

Good start, let's complete these tasks before moving to week 3!

alt-shreya commented 1 year ago

@GemmaTuron thank you! Although my progress is slow and steady, I'm enthusiastic about the way forward!

Here are some updates:

Update 2: Resolving BadZipError

I added my errors to a log file:
bad_zip_error.txt

There seems to be a Bad_Zip_Error and something wrong with TensorFlow. Apparently the models get downloaded, but not extracted.
I deleted the model files and reinstalled them through the import command. That populated the models folder in my home directory. Seems to be okay

Update 3: Resolving Java Exception Error

on running the following command, IUPAC_name = translate_forward(SMILES)

I run into the following Java Exception Java_exception_error.txt

I think this has something to do with the Smiles Generator, but I'm not sure how to solve it.

This is the version of Java I'm runnig on my Fedora:

openjdk 17.0.4.1 2022-08-12
OpenJDK Runtime Environment (Red_Hat-17.0.4.1.1-3.fc37) (build 17.0.4.1+1)
OpenJDK 64-Bit Server VM (Red_Hat-17.0.4.1.1-3.fc37) (build 17.0.4.1+1, mixed mode, sharing)

@ZakiaYahya @masroor07 did you face similar errors while running the commands?

masroor07 commented 1 year ago

Hey @alt-shreya, Are you providing the right input to the function when you calling it?

alt-shreya commented 1 year ago

@masroor07 yes

GemmaTuron commented 1 year ago

Hi @alt-shreya

This line in the error log seems to indicate you are not passing the right input smiles:An InChI could not be generated and used to canonise SMILES: null

alt-shreya commented 1 year ago

@GemmaTuron I used this command to write my input:


SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)```

masroor07 commented 1 year ago

@GemmaTuron I used this command to write my input:
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"

IUPAC_name = translate_forward(SMILES)```

Make sure it is a valid smile

@GemmaTuron I used this command to write my input:
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)```

Could you try running it for a different SMILE from the eml file and check whether you get the same error?

alt-shreya commented 1 year ago

Thank you @masroor07 and @GemmaTuron, I tried the same function with some other SMILES formulae, and they worked. I will be updating this space with my progress now.

alt-shreya commented 1 year ago

Update 4: Predicting EML file

After a little help from my friends who also happened to be working on this model, I finally made my first translation. Once again, it was the compound vanillin. Following is the code I entered:

from STOUT import translate_forward, translate_reverse
import pandas as pd
import numpy as np

SMILES = "SMILES = 'c1(C=O)cc(OC)c(O)cc1'
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

# IUPAC name to SMILES translation

IUPAC = "4-hydroxy-3-methoxybenzaldehyde"
SMILES = translate_reverse(IUPAC)
print("SMILES of "+IUPAC_name+" is: "+SMILES)

Here is the file with the outputs Output.txt

Then, I had only to perform the same tasks on all 422 rows of the eml_canonical.csv file. In order to do so, here are the steps I undertook:

Running Predictions for the EML

# converting dataframe to a list of strings

df = pd.read_csv('/filepath/eml_canonical.csv')
eml_smiles = df["smiles"]
eml_smiles_list= eml_smiles.tolist()
IUPAC_list = []

# for loop to convert every entry in the CSV file to an IUPAC name
for smile in eml_smiles_list:
  IUPAC_name = translate_forward(smile)
  IUPAC_list.append(IUPAC_name)

# converting IUPAC string to a CSV
new_dict = {"smiles": eml_smiles_list, "IUPAC": IUPAC_list}
output = pd.DataFrame(new_dict)
output.to_csv('STOUT_OUT.csv')

As of now, the code is still running, perhaps due to the sheer number of entries in the file. More updates to follow:

My system ran out of memory. Will update with fewer entries.

Next Steps

If this doesn't get resolved in a few hours, I will create a separate file with fewer entries and run that through my for loop. This ought to give me a sample of outputs, enough to build an understanding of the STOUT model.

masroor07 commented 1 year ago

Thank you @masroor07 and @GemmaTuron, I tried the same formula with some other SMILES formulae, and they worked. I will be updating this space with my progress now. Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.

masroor07 commented 1 year ago

Thank you @masroor07 and @GemmaTuron, I tried the same formula with some other SMILES formulae, and they worked. I will be updating this space with my progress now.

Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.

alt-shreya commented 1 year ago

Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.

@masroor07 I think it's some form of caffeine, but yeah, it doesn't exist in the EML file. Probably because it is not an essential medicine?

The SMILES for caffeine is this: CN1C=NC2=C1C(=O)N(C(=O)N2C)C

alt-shreya commented 1 year ago

@GemmaTuron I am happy to report new updates after running the file through Google Colab:

Update 4.1: Running Predictions on EML File

I performed the same code on Colab and here are the results: STOUT_OUT_fwd.csv

The output file contains a Pandas dataframe with the regular SMILES form of the compound and its IUPAC nomenclature. I performed a similar experiment for the reverse translation as well. While I was waiting for the results to get processed, I was curious to know why this is considered a translation process and not a conversion process. I read up a bit and discovered that SMILES are treated as a word, from the point of view of formal language. This means it is intuitive to perform language processing on them as a word.

Next Steps:

Running the same steps through Ersilia's implementation of STOUT
Comparing the difference between the two models

alt-shreya commented 1 year ago

Update 5: Comparing the files

Presenting my findings in this table

SMILES	Ersilia Model	STOUT Model
CC(=O)NC@@HC(O)=O	(2R)-2-acetamido-3-sulfanylpropanoicacid;;;	(2R)-2-acetamido-3-sulfanylpropanoicacid;;;
CC(=O)Oc1ccccc1C(O)=O	2-acetyloxybenzoicacid;;;	2-acetyloxybenzoicacid;;;
NC1=NC(=O)c2ncn(COCCO)c2N1	2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one;;;	2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one;;;
NC(N)=NC(=O)c1nc(Cl)c(N)nc1N	;3,5-diamino-2-chloro-N-(diaminomethylidene)-2H-pyrazine-6-carboxamide;;;	3,5-diamino-6-chloro-N-(diaminomethylidene)pyrazine-2-carboxamide;;;
CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3	2-butyl-3-[4-[2-(diethylamino)ethoxy]-3,5-diiodocyclohexa-1,4-dien-1-yl]chromen-4-one;;;	(2-butyl-1-benzofuran-3-yl)-[4-[2-(diethylamino)ethoxy]-3,5-diiodophenyl]methanone;;

GemmaTuron commented 1 year ago

Hi @alt-shreya Good job, let's move to week 3 tasks!

alt-shreya commented 1 year ago

Proposed Model 1: DTI2VEC

Model Name:

DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning

Model Summary

DTi2Vec identifies DTIs using network representation learning and ensemble learning techniques. It constructs the heterogeneous network, and then it automatically generates features for each drug and target using the nodes embedding technique. DTi2Vec is a simple yet effective method that provides high DTI prediction performance while being scalable and efficient in computation, translating into a powerful drug repositioning tool.

Datasets

Nuclear Receptor dataset (nr),
G-protein-coupled receptor (gpcr),
Ion Channel (ic),
Enzyme (e)
FDA_DrugBank (DrugBank) which each one of them has all required data of drug-target interactions (in Adjacency matrix and edgelist format) and drug-drug similarity and target-target similarity in (square matrix format)

Slug: dti2vec

Tag: machine-learning, cheminformatics, drug-repurposing, node2vec-embeddings, boosting-ensemble

Task: Classification

Package Dependencies:

gensim (for node2vec code)
numpy
Scikit-learn
imblearn
pandas
xgboost

Publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459562/

Reference:

Thafar MA, Olayan RS, Albaradei S, Bajic VB, Gojobori T, Essack M, Gao X. DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning. J Cheminform. 2021 Sep 22;13(1):71. doi: 10.1186/s13321-021-00552-w. PMID: 34551818; PMCID: PMC8459562.

Source Code Github Respository

License: Creative Commons

alt-shreya commented 1 year ago

Proposed Model 2: DeepCC

Model Name: DeepCC

Model Summary: a novel deep learning-based framework for cancer molecular subtype classification

Model Description:

A novel supervised cancer classification framework, deep cancer subtype classification (DeepCC), based on deep learning of functional spectra quantifying activities of biological pathways. DeepCC provides a novel cancer classification framework that is platform independent, robust to missing data, and can be used for single sample prediction facilitating clinical implementation of cancer molecular subtyping.

Datasets

TANSBIG, UNT, UPP, and NKI
TCGA CRC set - level 3 RNASeq data
MXNet

Slug: deepcc

Tag: molecular subtyping, individualized therapy

Task: Classification

Language: R, Python

Package Dependencies

keras
R (version 3.3, or higher)
C
MXNet (version 0.10, or higher)

Publication

Source Code

Github Respository

Reference

Gao, F., Wang, W., Tan, M. et al. DeepCC: a novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis 8, 44 (2019). https://doi.org/10.1038/s41389-019-0157-8

License

MIT License

alt-shreya commented 1 year ago

Proposed Model 3: LOTUS

Model Name: LOTUS

Model Summary

LOTUS is a new computational approach to identify genes with high oncogenic potential. It implements a machine learning approach to learn an oncogenic potential score from known driver genes, and brings two novelties compared to existing methods.

Model Description

LOTUS is a machine-learning based approach which allows to integrate various types of data in a versatile manner, including information about gene mutations and protein-protein interactions. In addition, LOTUS can predict cancer driver genes in a pan-cancer setting as well as for specific cancer types, using a multitask learning strategy to share information across cancer types.

Datasets

MutSigCV
TUSON
20/20+
Cancer Census Gene (v86)

Slug: lotus-driver-genes Tag: Mutation databases, Gene prediction, Machine learning Task: Classification Language: R

Package Dependencies: none listed

Publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007381

Source Code: Github Respository

Reference

O. Collier, V. Stoven and J.-P. Vert, LOTUS: a Single- and Multitask Machine Learning Algorithm for the Prediction of Cancer Driver Genes, Preprint. doi: https://doi.org/10.1101/398537

License: Creative Commons Attribution License

GemmaTuron commented 1 year ago

Hi @alt-shreya

Thanks for these suggestions, these are good models but mostly focused on cancer, which is out of the scope for us at this moment. Let's wrap up by preparing your final application

ersilia-os / ersilia

✍️ Contribution period: Shreya #642

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

My Motivation to Work at Ersilia

Week 2 Log

Main Tasks

Things that Motivate me

Task Report

Task 1: Selecting a Model

Installing

Update1: It failed to install due to Compatibility Errors

Possible Solution

Next Steps

Update 2: Resolving BadZipError

Update 3: Resolving Java Exception Error

Update 4: Predicting EML file

Running Predictions for the EML

Next Steps

Update 4.1: Running Predictions on EML File

Next Steps:

Update 5: Comparing the files

Proposed Model 1: DTI2VEC

Model Name:

Model Summary

Datasets

Slug: dti2vec

Tag: machine-learning, cheminformatics, drug-repurposing, node2vec-embeddings, boosting-ensemble

Task: Classification

Package Dependencies:

Publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459562/

Reference:

Source Code Github Respository

License: Creative Commons

Proposed Model 2: DeepCC

Model Name: DeepCC

Model Summary: a novel deep learning-based framework for cancer molecular subtype classification

Model Description:

Datasets

Slug: deepcc

Tag: molecular subtyping, individualized therapy

Task: Classification

Language: R, Python

Package Dependencies

Publication

Source Code

Reference

License

Proposed Model 3: LOTUS

Model Name: LOTUS

Model Summary

Model Description

Datasets

Publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007381

Source Code: Github Respository

Reference