Closed alt-shreya closed 1 year ago
Hi @alt-shreya
Welcome to Ersilia! Please, make sure to complete week 1 tasks before moving on to week 2, there is still time to catch up!
@GemmaTuron thank you! Out of curiosity, does this page of the docs explain tasks we need to perform in Week 2? I wanted to complete the third task for Week 1 (test the simplest model) and that's when I got a little confused.
GemmaTuron I ran into an error, and need your help
I'm using Fedora 37 and trying to test the simplest model (Week 1 Task 4).
In order to fetch the model, I executed this command:
ersilia fetch retrosynthetic-accessibility
and got this error:
/bin/sh: line 1: shasum: command not found
Command '<<address of the file>> shasum -a 256 data.h5;' returned non-zero exit status 127.
I also checked the log files in order to understand where the error may have been, but the log file is empty.
EDIT: I realised that I was fetching a different model . On the official contribution guide for this year, it clearly details the step to run and test the models.
Completed instructions and calculated the molecular weight of C4
as 58.124
Hi @alt-shreya
Do you have git-lfs installed? it seems you were not able to download the .h5 file initially
Dear Gemma Turon,
In order to explain my motivation, I’d like to start this letter with an anecdote. When I was in kindergarten, my school conducted a free health checkup for the students. This was the first time I had been to a clinic. And I was terrified. To me, the unknown doctors behind their masks seemed like the minions of an unknown mastermind. Eighteen years later, I am glad to report the fear I once held has given its way to a sense of deep respect for the healthcare industry. That is my primary motivation behind applying for this internship.
As an aspiring AI professional, I am eager to gain direct exposure to the ML industry while contributing to a cause that I am passionate about – making medical research more accessible. I resonate with your accessible and open approach to research, as well as your commitment to building sustainable collaborations with leading local entities. I believe that collaboration is essential to achieving breakthroughs in science and technology, and I am highly enthusiastic to join an open-source software community that shares this value.
Throughout my academic and personal journey, I have developed a strong interest in machine learning and artificial intelligence. During my Bachelor’s degree, I was determined to make the most of this interest to make an impact in the healthcare industry. I led a team of four to build a prototype for a contactless authentication device for a post-COVID world. This experience taught me important lessons on making technology accessible, combining technical skills with a passion for solving real-world problems, perseverance, and leadership.
Growing up in India, I have always understood the importance of making science affordable and accessible to everyone. As an intern at the Ersilia Open Source Initiative, I believe I would have the opportunity to contribute to this cause while honing my skills in machine learning under the guidance of experienced mentors. I am detail-oriented and have honed my research skills through hours of academic work. Additionally, I also volunteered extensively during the COVID-19 pandemic to help raise funds for those in need. Having graduated into the workforce under extraordinary circumstances, I quickly learnt to thrive in remote work environments and enjoy collaborating with a diverse group of people.
In conclusion, I am honoured to have the opportunity to apply for the internship position at Ersilia Open Source Initiative, and I believe that my experience, skills, and passion align well with the values and goals of your non-profit organisation. Thank you for considering my application. I would be happy to discuss my candidature in detail via the following platforms:
Email: shreyakumar31@gmail.com LinkedIn: https://www.linkedin.com/in/alt-shreya GitHub: https://www.github.com/alt-shreya
These tasks provide a hands-on glimpse of my responsibilities during the internship; finding, testing, debugging and implementing open source code within Ersilia.
At first glance, STOUT seems really interesting. When I was in high school trying to understand IUPAC naming conventions, I wondered what it would be like to have an automatic name generator which could make life easy for me.
For context, IUPAC names are a standard set of names for compounds decided by the Internatinoal Union of Pure and Applied Chemistry. Under the IUPAC convention, the nomenclature of vanillin, the compound that lends vanilla essence its flavour, would be written as 4-hydroxy-3-methoxybenzaldehyde
.
SMILES strings are a line notation that represent the structure of the compound. The SMILES for the compound vanillin would be c1(C=O)cc(OC)c(O)cc1
Aside from personal life, I think this could be significant in minimising errors in drug discovery.
I am curious and highly motivated to implement this.
Taking a deeper dive into the research paper, I am intrigued by their approach towards representing chemical entities and string conversion.
I'd like to see this in action.
conda create --name STOUT python=3.8
conda activate STOUT
conda install -c decimer stout-pypi
the last step resulted in an error as there were some incompatible packages. So I decided to try another approach: installing repo from Git, like so:
pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git
This successfully built STOUT.
However, when I tried to import Python, it gave me a module not found error. I realised I had to install the dependencies independently, something I had inadvertently missed out on. I install the requirements using
pip install -r Python_requirements.txt
This approach also led to some errors. On reading the error descriptions, I understood it is due to the proper version of Tensorflow not being installed. To fix this, I installed Tensorflow out of my conda environment.
As of writing this, TensorFlow is still installing on my system. I will update this space here with further inputs.
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:
- feature:/linux-64::__glibc==2.36=0
- feature:|@/linux-64::__glibc==2.36=0
Your installed version is: 2.36
I think this error arises because conda is bad at reporting conflicts. It is possible that some package in STOUT has a discrepant version, which leads to these errors. I will get to the bottom of this.
My next steps would be to
Hi @alt-shreya
Good start, let's complete these tasks before moving to week 3!
@GemmaTuron thank you! Although my progress is slow and steady, I'm enthusiastic about the way forward!
Here are some updates:
I added my errors to a log file:
bad_zip_error.txt
models
folder in my home directory. Seems to be okayon running the following command,
IUPAC_name = translate_forward(SMILES)
I run into the following Java Exception Java_exception_error.txt
I think this has something to do with the Smiles Generator, but I'm not sure how to solve it.
This is the version of Java I'm runnig on my Fedora:
openjdk 17.0.4.1 2022-08-12
OpenJDK Runtime Environment (Red_Hat-17.0.4.1.1-3.fc37) (build 17.0.4.1+1)
OpenJDK 64-Bit Server VM (Red_Hat-17.0.4.1.1-3.fc37) (build 17.0.4.1+1, mixed mode, sharing)
@ZakiaYahya @masroor07 did you face similar errors while running the commands?
Hey @alt-shreya, Are you providing the right input to the function when you calling it?
@masroor07 yes
Hi @alt-shreya
This line in the error log seems to indicate you are not passing the right input smiles:An InChI could not be generated and used to canonise SMILES: null
@GemmaTuron I used this command to write my input:
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)```
@GemmaTuron I used this command to write my input:
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" IUPAC_name = translate_forward(SMILES)```
Make sure it is a valid smile
@GemmaTuron I used this command to write my input:
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" IUPAC_name = translate_forward(SMILES)```
Could you try running it for a different SMILE from the eml file and check whether you get the same error?
Thank you @masroor07 and @GemmaTuron, I tried the same function with some other SMILES formulae, and they worked. I will be updating this space with my progress now.
After a little help from my friends who also happened to be working on this model, I finally made my first translation. Once again, it was the compound vanillin. Following is the code I entered:
from STOUT import translate_forward, translate_reverse
import pandas as pd
import numpy as np
SMILES = "SMILES = 'c1(C=O)cc(OC)c(O)cc1'
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)
# IUPAC name to SMILES translation
IUPAC = "4-hydroxy-3-methoxybenzaldehyde"
SMILES = translate_reverse(IUPAC)
print("SMILES of "+IUPAC_name+" is: "+SMILES)
Here is the file with the outputs Output.txt
Then, I had only to perform the same tasks on all 422 rows of the eml_canonical.csv
file. In order to do so, here are the steps I undertook:
# converting dataframe to a list of strings
df = pd.read_csv('/filepath/eml_canonical.csv')
eml_smiles = df["smiles"]
eml_smiles_list= eml_smiles.tolist()
IUPAC_list = []
# for loop to convert every entry in the CSV file to an IUPAC name
for smile in eml_smiles_list:
IUPAC_name = translate_forward(smile)
IUPAC_list.append(IUPAC_name)
# converting IUPAC string to a CSV
new_dict = {"smiles": eml_smiles_list, "IUPAC": IUPAC_list}
output = pd.DataFrame(new_dict)
output.to_csv('STOUT_OUT.csv')
As of now, the code is still running, perhaps due to the sheer number of entries in the file. More updates to follow:
My system ran out of memory. Will update with fewer entries.
If this doesn't get resolved in a few hours, I will create a separate file with fewer entries and run that through my for
loop. This ought to give me a sample of outputs, enough to build an understanding of the STOUT model.
Thank you @masroor07 and @GemmaTuron, I tried the same formula with some other SMILES formulae, and they worked. I will be updating this space with my progress now. Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.
Thank you @masroor07 and @GemmaTuron, I tried the same formula with some other SMILES formulae, and they worked. I will be updating this space with my progress now.
Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.
Hi @GemmaTuron , I tried looking for the SMILE "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" in the EML, could not find it there.
@masroor07 I think it's some form of caffeine, but yeah, it doesn't exist in the EML file. Probably because it is not an essential medicine?
The SMILES for caffeine is this: CN1C=NC2=C1C(=O)N(C(=O)N2C)C
@GemmaTuron I am happy to report new updates after running the file through Google Colab:
I performed the same code on Colab and here are the results: STOUT_OUT_fwd.csv
The output file contains a Pandas dataframe with the regular SMILES form of the compound and its IUPAC nomenclature. I performed a similar experiment for the reverse translation as well. While I was waiting for the results to get processed, I was curious to know why this is considered a translation process and not a conversion process. I read up a bit and discovered that SMILES are treated as a word, from the point of view of formal language. This means it is intuitive to perform language processing on them as a word.
Presenting my findings in this table
SMILES | Ersilia Model | STOUT Model |
---|---|---|
CC(=O)NC@@HC(O)=O | (2R)-2-acetamido-3-sulfanylpropanoicacid;;; | (2R)-2-acetamido-3-sulfanylpropanoicacid;;; |
CC(=O)Oc1ccccc1C(O)=O | 2-acetyloxybenzoicacid;;; | 2-acetyloxybenzoicacid;;; |
NC1=NC(=O)c2ncn(COCCO)c2N1 | 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one;;; | 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one;;; |
NC(N)=NC(=O)c1nc(Cl)c(N)nc1N | ;3,5-diamino-2-chloro-N-(diaminomethylidene)-2H-pyrazine-6-carboxamide;;; | 3,5-diamino-6-chloro-N-(diaminomethylidene)pyrazine-2-carboxamide;;; |
CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3 | 2-butyl-3-[4-[2-(diethylamino)ethoxy]-3,5-diiodocyclohexa-1,4-dien-1-yl]chromen-4-one;;; | (2-butyl-1-benzofuran-3-yl)-[4-[2-(diethylamino)ethoxy]-3,5-diiodophenyl]methanone;; |
Hi @alt-shreya Good job, let's move to week 3 tasks!
DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning
DTi2Vec identifies DTIs using network representation learning and ensemble learning techniques. It constructs the heterogeneous network, and then it automatically generates features for each drug and target using the nodes embedding technique. DTi2Vec is a simple yet effective method that provides high DTI prediction performance while being scalable and efficient in computation, translating into a powerful drug repositioning tool.
Thafar MA, Olayan RS, Albaradei S, Bajic VB, Gojobori T, Essack M, Gao X. DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning. J Cheminform. 2021 Sep 22;13(1):71. doi: 10.1186/s13321-021-00552-w. PMID: 34551818; PMCID: PMC8459562.
A novel supervised cancer classification framework, deep cancer subtype classification (DeepCC), based on deep learning of functional spectra quantifying activities of biological pathways. DeepCC provides a novel cancer classification framework that is platform independent, robust to missing data, and can be used for single sample prediction facilitating clinical implementation of cancer molecular subtyping.
Gao, F., Wang, W., Tan, M. et al. DeepCC: a novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis 8, 44 (2019). https://doi.org/10.1038/s41389-019-0157-8
MIT License
LOTUS is a new computational approach to identify genes with high oncogenic potential. It implements a machine learning approach to learn an oncogenic potential score from known driver genes, and brings two novelties compared to existing methods.
LOTUS is a machine-learning based approach which allows to integrate various types of data in a versatile manner, including information about gene mutations and protein-protein interactions. In addition, LOTUS can predict cancer driver genes in a pan-cancer setting as well as for specific cancer types, using a multitask learning strategy to share information across cancer types.
Slug: lotus-driver-genes Tag: Mutation databases, Gene prediction, Machine learning Task: Classification Language: R
Package Dependencies: none listed
O. Collier, V. Stoven and J.-P. Vert, LOTUS: a Single- and Multitask Machine Learning Algorithm for the Prediction of Cancer Driver Genes, Preprint. doi: https://doi.org/10.1101/398537
License: Creative Commons Attribution License
Hi @alt-shreya
Thanks for these suggestions, these are good models but mostly focused on cancer, which is out of the scope for us at this moment. Let's wrap up by preparing your final application
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application