Closed Ajoke23 closed 8 months ago
WEEK 1 DAY 1 (3rd October 2023)
DAY 2 (4th October, 2023)
wsl --install
sudo apt install build-essential
conda install gh -c conda-forge
and I used GitHub CLI to log in using the command below:
gh auth login
conda install git-lfs -c conda-forge
git-lfs install
conda activate ersilia
python -m pip install isaura==0.1
ersilia --help
and ersilia catalog
which gave the following output output2.txt & output3.txt. Thus, this output shows that Ersilia has successfully been installed on Ubuntu.eos3b5e
using this code ersilia -v fetch eos3b5e
, I encountered an error which says: connection aborted, TimeoutError(110, 'Connection timed out). I tried debugging the error by checking online, stack overflow, and previous issues raised in the Ersilia repository but none solved my problem.DAY 3 (5th October 2023)
DAY 4 (6th October 2023)
eos3b5e
eos3b5e
, and calculating the molecular weight as required in the task using the following code:
ersilia -v fetch eos3b5e
ersilia -v serve eos3b5e
ersilia -v api run -i "CCCC"
and the following output was fetch.txt, serve.txt & molecular_weight.txt
DAY 5 (7th October 2023)
DAY 6 (8th October 2023)
Hi @Ajoke23 thank you for the updates. I see that some items from the week 1 tasks are still pending. Please tell us if you'd like any support in completing them.
Hi @Ajoke23 thank you for the updates. I see that some items from the week 1 tasks are still pending. Please tell us if you'd like any support in completing them.
Yes, I need support. I am finding it hard to fetch model eos3b5e. I am getting errors regarding connections and I have asked on the slack channels, people gave suggestions and I have tried all that but it isn't working yet. I went online and saw some related post on the issue on ersilia's repository but none of the suggestion has worked. I am still trying my best to figure it out. I will appreciate any help from you
Hi, have you been able to fetch it?
MOTIVATION STATEMENT
I'm Ajoke Yusuf, a Data Scientist, Machine Learning enthusiast, and SDG 3 advocate. I'm a resourceful, goal-oriented individual who possesses strong analytical and problem-solving skills with an unending quest for knowledge. I pride myself on being a fast learner and honed strong skills in problem-solving and research. After receiving the Outreachy email, one of my aims before choosing a project is to check out the project whose aim, and mission align with my goal and career objective as an impact maker and an SDG 3 advocate.
I went through each of the projects and I came across Ersilia's project whose mission statement is:
"To equip laboratories in Low and Middle Income Countries with state of the art AI/ML tools for infectious and neglected disease research."
As an Engineering graduate living in Nigeria, I developed an interest in the biomedical field due to the increasing mortality rate of infectious diseases in Nigeria and sub-Saharan Africa. Based on statistics, according to UNICEF (United Nations International Children Emergency Funds):
infectious disease is the major cause of the mortality rate in children ≤ 5 years
This was cited from here.
Research from NIH (National Library of Medicine) & NCBI (National Centre for Biotechnology Information) confirms that:
"The infrastructure and level of support for surveillance, research, and training on emerging infectious diseases
in Africa are extremely limited".
Link here
As a Data Scientist, skilled at Python, Machine learning, I possess strong analytical and research skills. I believe that contributing to this project will help me garner knowledge and technical skills that will help in advancing and improving health research in Nigeria.
If accepted for the 3 months internship, I'll commit myself to bringing suggestions, undergoing research, and collaborating with the Ersilia team while learning and honing skills in Artificial Intelligence and Machine Learning. This period of internship will help propel my research and problem-solving skills which will be useful in the long run for the advancement of technology in the health sector, improving and making a sustainable impact in health research in Nigeria.
As a young lady living in Nigeria, an underdeveloped and low-income country, I have experienced challenges encountered in having accessibility tools for prevalent infectious diseases in my community and country (Nigeria) at large.
After the internship, I plan on utilizing the skills gained to improve and sustain health research tools, solve prevalent health disease issues in Nigeria, and reduce the mortality rate caused by infectious diseases. Thus, propelling sustainable research skills that will leave a long-lasting impact in the health sector in my community, Nigeria, sub-Saharan Africa, and eventually, globally. Also, I'll continue to contribute my quota to further success of Ersilia's project
Day 7 (9th October, 2023)
I selected the STOUT (SMILES to IUPAC) model and the reason for choosing this model for implementation includes the following: A. INTEREST IN THE APPLICATION: I can remember vividly when I was still in high school as a science student, I've always had difficulty in naming the nomenclature of chemical compounds so seeing a Machine Learning model that could do that, suddenly ignited my interest. In the health sector, IUPAC names are useful in communicating the structure and properties of potential drugs, aid in the development of drugs, and are useful in understanding the mechanism of action & metabolism of how drugs work in the body. As an SDG 3 advocate, this made me interested to further delving deeper into how building such a model is achieved because as a problem solver, I would love to incorporate the knowledge I gained in working with the model to help solve infectious diseases and sustain research problems in the health sector in Nigeria thus and eventually, would lead to the sustainability of scientific leadership with researchers B. ML ALGORITHMS USED: In the journal provided in the repository, it was stated that the model uses a deep learning method specifically NMT (Neural Machine Translation) which follows the implementation of Google NMT models for SMILES to IUPAC name translation. I want to understand the knowledge and thought process behind the implementation. C. GOAL TO BE ACHIEVED: As a machine learning enthusiast, data scientist, problem solver, and SDG 3 advocate, this will give me have deeper understanding and technical knowledge to execute tasks and solve problems easily. This knowledge regarding NMT will make it easier to collaborate and build a model that will serve as a tool for researchers interested in working and solving infectious disease problems in Nigeria and globally. Thus, reducing the mortality rate of infectious diseases in Nigeria (a low-income country).
Day 8 (10th October, 2023)
I followed the installation instructions on the STOUT model GitHub repository.
pip install --upgrade pip
and this
output7.log show that the upgrade was successful.conda create --name STOUT python=3.8
conda activate STOUT
conda install -c decimer stout-pyp
pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git
and I got the same error.log I was getting using the first two method of installationDay 9 (11th October,2023) - Day 15 (17th October, 2023)
Day 16 (18th October, 2023)
pip install STOUT-pypi
using Google Collab.
from STOUT import translate_forward, translate_reverse
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" IUPAC_name = translate_forward(SMILES) print("IUPAC name of "+SMILES+" is: "+IUPAC_name)
IUPAC_name = "1,3,7-trimethylpurine-2,6-dione" SMILES = translate_reverse(IUPAC_name) print("SMILES of "+IUPAC_name+" is: "+SMILES)
which gave the following [output.txt](https://github.com/ersilia-os/ersilia/files/13169158/output.txt)
### **TASK 3 - RUN PREDICTION FOR THE EML PROVIDED**
DAY 17 (19 October, 2023)
- I clicked on this [link](https://raw.githubusercontent.com/ersilia-os/ersilia/master/notebooks/eml_canonical.csv), right click and select 'save as' which automatically downloaded the file. The downloaded file named is: [eml_canonical.csv](https://github.com/ersilia-os/ersilia/files/13170404/eml_canonical.csv).
- To run the prediction for Essential Medicine list, I demonstrated my knowledge of python to be able to achieve it
from google.colab import files uploaded = files.upload()
import pandas as pd import io from STOUT import translate_forward
df = pd.read_csv(io.BytesIO(uploaded['eml_canonical.csv'])) print(df)
df = df.head(40)
def smiles_to_iupac(smiles): iupac = translate_forward(smiles) return iupac
def can_smiles_to_iupac(can_smiles): iupac = translate_forward(can_smiles) return iupac
df['smiles_iupac']=df['smiles'].apply(smiles_to_iupac) df['can_smiles_iupac'] = df['can_smiles'].apply(can_smiles_to_iupac)
df.loc[:, 'smiles_iupac'] = df['smiles'].apply(smiles_to_iupac) df.loc[:, 'can_smiles_iupac'] = df['can_smiles'].apply(can_smiles_to_iupac)
smiles_iupac = df[['drugs', 'smiles', 'smiles_iupac']].copy() smiles_iupac
can_smiles_iupac = df[['drugs', 'can_smiles', 'can_smiles_iupac']].copy() can_smiles_iupac
smiles_iupac.to_csv('smiles_iupac.csv', index=False) can_smiles_iupac.to_csv('can_smiles_iupac.csv', index=False)
Due to the running time of executing large volume of data on Google Collab, I decided to limit the prediction to the first 40 molecules of eml_canonical dataset
The output of the code is:
[smiles_iupac.csv](https://github.com/ersilia-os/ersilia/files/13170854/smiles_iupac.csv)
[can_smiles_iupac.csv](https://github.com/ersilia-os/ersilia/files/13170855/can_smiles_iupac.csv)
### **Task 4- Compare results with the Ersilia Model Hub implementation!**
Day 18 (20th October, 2023)
- I started the process by searching for the STOUT model identifier on [Ersilia Model Hub](https://www.ersilia.io/model-hub)
- On seeing the STOUT: SMILES to IUPAC name translator on [Ersilia Model Hub](https://www.ersilia.io/model-hub), I clicked on the [GitHub](https://github.com/ersilia-os/eos4se9) button
- The STOUT model has an EOS model ID: `eos4se9` and the name of the Slug is: `smiles2iupac`
- I used the following code to fetch, serve and run the model prediction
ersilia -v fetch eos4se9 ersilia -v serve eos4se9 ersilia -v api run -i smiles_iupac.csv -o smilesoutput.csv
I successfully fetched and [served](https://github.com/ersilia-os/ersilia/files/13184630/modelserve.log) the model but the output after running the model prediction I noticed the iupacs_names columns was empty which implied that i got no output i.e there was no iupac name translation for smile input.
Output file: [smilesoutput.csv](https://github.com/ersilia-os/ersilia/files/13184672/smilesoutput.csv)
- Trying to debug where the problem came from, I decided to run the model with an input string using this command `ersilia -v api run -i "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"`. Output shown below:
{ "input": { "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N", "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1", "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1" }, "output": { "outcome": [ null ] }
I was expecting to get an iupac_name but I got a null value as outcome.
Day 19 (21th October, 2023)
- In the process of debugging, I noticed that another contributor also faced the same challenges and I saw @HellenNamulinda [suggestion](https://github.com/ersilia-os/ersilia/issues/821#issuecomment-1759694897) regarding the issue so I tried the option of fetching the model from GitHub, by adding the `--from_github` flag in the command.
- Command used :` ersilia -v fetch eos4se9 --from_github > eos4se9_fetch_github.log 2>&1` and the output can be found [eos4se9_fetch_github.log](https://github.com/ersilia-os/ersilia/files/13187659/eos4se9_fetch_github.log). This show I successfully fetched the model.
- I went ahead to run a model prediction again with an input string using the command I previously used `ersilia -v api run -i 'Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1`' and below the output . Log file of the output can be found [input_outcome.log](https://github.com/ersilia-os/ersilia/files/13187950/input_outcome.log)
{ "input": { "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N", "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1", "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1" }, "output": { "outcome": [ "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol" ] } }
- I decided to try another input command using this code: `ersilia -v api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"`
Output:
{ "input": { "key": "NQQBNZBOOHHVQP-UHFFFAOYSA-N", "input": "C1=C(SC(=N1)SC2=NN=C(S2)N)N+[O-]", "text": "C1=C(SC(=N1)SC2=NN=C(S2)N)N+[O-]" }, "output": { "outcome": [ "5-[(5-nitro-1,3-thiazol-2-yl)sulfanyl]-1,3,4-thiadiazol-2-amine" ] } }
**Summary**
Using this command below for smiles Ersilia STOUT prediction using the first 40 molecules in the EML dataset
ersilia -v fetch eos4se9 --from_github > eos4se9_fetch_github.log 2>&1 ersilia -v serve eos4se9 > eos4se9_serve_model.log 2>&1 ersilia -v api run -i smiles_iupac.csv -o smiles_output.csv
[ersilia_smiles_output](https://github.com/ersilia-os/ersilia/files/13189450/smiles_output.csv)- Ersilia STOUT prediction of smiles to iupac
### **COMPARISON OF SMILES OUTPUT PREDICTION USING ERSILIA AND STOUT PREDICTION**
ERSILIA prediction output: [smiles_output.csv](https://github.com/ersilia-os/ersilia/files/13189683/smiles_output.csv)
STOUT prediction output: [smiles_iupac.csv](https://github.com/ersilia-os/ersilia/files/13189687/smiles_iupac.csv)
Since I've two different csv file that contains ersilia prediction output and STOUT prediction output. I decided to merge the two datasets together using my knowledge of Python and selecting the necessary columns to be shown as ouput.
using the Python code below
import pandas as pd
smiles_iupac = pd.read_csv(r"\wsl.localhost\Ubuntu\home\ajoke\smiles_iupac.csv") smiles_output = pd.read_csv(r"\wsl.localhost\Ubuntu\home\ajoke\smiles_output.csv")
merged_dataset = smiles_iupac.merge(smiles_output, left_on='smiles', right_on='input')
merged_dataset.rename(columns={'input': 'input/smiles', 'iupacs_names': 'Ersilia STOUT prediction', 'smiles_iupac': 'STOUT prediction'}, inplace=True)
merged_dataset = merged_dataset[[ 'Ersilia STOUT prediction', 'STOUT prediction', 'input/smiles', 'drugs']] merged_dataset
merged_dataset.to_csv('comparison_dataset.csv', index=False)
[merged_ouput](https://github.com/ersilia-os/ersilia/files/13196354/comparison_dataset.csv) - Output of the merged dataset in csv format. Below is the table format
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/HP/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/HP/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<!--table
{mso-displayed-decimal-separator:"\.";
mso-displayed-thousand-separator:"\,";}
@page
{margin:.75in .7in .75in .7in;
mso-header-margin:.3in;
mso-footer-margin:.3in;}
tr
{mso-height-source:auto;}
col
{mso-width-source:auto;}
br
{mso-data-placement:same-cell;}
td
{padding-top:1px;
padding-right:1px;
padding-left:1px;
mso-ignore:padding;
color:black;
font-size:11.0pt;
font-weight:400;
font-style:normal;
text-decoration:none;
font-family:Calibri, sans-serif;
mso-font-charset:0;
mso-number-format:General;
text-align:general;
vertical-align:bottom;
border:none;
mso-background-source:auto;
mso-pattern:auto;
mso-protection:locked visible;
white-space:nowrap;
mso-rotate:0;}
-->
</head>
<body link="#0563C1" vlink="#954F72">
Ersilia STOUT prediction | STOUT prediction | input/smiles | drugs
-- | -- | -- | --
[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol | [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol | Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 | abacavir
(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol | (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol | C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 | abiraterone
N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide | N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide | CC(=O)Nc1sc(nn1)[S](N)(=O)=O | acetazolamide
aceticacid | aceticacid | CC(O)=O | acetic acid
(2R)-2-acetamido-3-sulfanylpropanoicacid | (2R)-2-acetamido-3-sulfanylpropanoicacid | CC(=O)N[C@@H](CS)C(O)=O | acetylcysteine
2-acetyloxybenzoicacid | 2-acetyloxybenzoicacid | CC(=O)Oc1ccccc1C(O)=O | acetylsalicylic acid
2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one | 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one | NC1=NC(=O)c2ncn(COCCO)c2N1 | aciclovir
2-[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]oxy-1,1-dithiophen-2-ylethanol | [(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]2-hydroxy-2,2-dithiophen-2-ylacetate | OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 | aclidinium
(E)-N-[6-[[(3-chloro-4-fluorocyclohexa-1,4-dien-1-yl)amino]methylidene]-3-[(3S)-oxolan-3-yl]oxycyclopenta[d]pyrimidin-2-yl]-4-(dimethylamino)but-2-enamide | (E)-N-[4-(3-chloro-4-fluoroanilino)-7-[(3S)-oxolan-3-yl]oxyquinazolin-6-yl]-4-(dimethylamino)but-2-enamide | CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1 | afatinib
methylN-(6-propylsulfanyl-1H-benzimidazol-2-yl)carbamate | methylN-(6-propylsulfanyl-1H-benzimidazol-2-yl)carbamate | CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1 | albendazole
1,2-dihydropyrazolo[3,4-d]pyrimidin-4-one | 1,2-dihydropyrazolo[3,4-d]pyrimidin-4-one | O=C1N=CN=C2NNC=C12 | allopurinol
5-acetamido-2,4,6-triiodo-3-(1-oxoethylamino)cyclohexa-4,6-diene-1-carboxylicacid | 3,5-diacetamido-2,4,6-triiodobenzoicacid | CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(O)=O)c1I | amidotrizoate
(2S)-N-[(1R,2R,3R,5S,6R)-5-amino-2-[(2R,3R,4R,5R,6R)-3-amino-4,5,6-trihydroxyoxan-2-yl]oxy-3-[(2R,3S,4R,5R)-5-amino-1,3,4,6-tetrahydroxyhexan-2-yl]oxy-1-hydroxyoxetan-6-yl]-2-hydroxy-4-(methylamino)butanamide | (2S)-4-amino-N-[(1R,2S,3S,4R,5S)-5-amino-2-[(2S,3R,4S,5S,6R)-4-amino-3,5-dihydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-[(2R,3R,4S,5S,6R)-6-(aminomethyl)-3,4,5-trihydroxyoxan-2-yl]oxy-3-hydroxycyclohexyl]-2-hydroxybutanamide | NCC[C@H](O)C(=O)N[C@@H]1C[C@H](N)[C@@H](O[C@H]2O[C@H](CN)[C@@H](O)[C@H](O)[C@H]2O)[C@H](O)[C@H]1O[C@H]3O[C@H](CO)[C@@H](O)[C@H](N)[C@H]3O | amikacin
3,5-diamino-2-chloro-N-(diaminomethylidene)-2H-pyrazine-6-carboxamide | 3,5-diamino-6-chloro-N-(diaminomethylidene)pyrazine-2-carboxamide | NC(N)=NC(=O)c1nc(Cl)c(N)nc1N | amiloride
2-butyl-3-[4-[2-(diethylamino)ethoxy]-3,5-diiodocyclohexa-1,4-dien-1-yl]chromen-4-one | (2-butyl-1-benzofuran-3-yl)-[4-[2-(diethylamino)ethoxy]-3,5-diiodophenyl]methanone | CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3 | amiodarone
N,N-dimethyl-3-(2-tricyclo[9.4.0.03,8]pentadeca-1(15),3,5,7,11,13-hexaenylidene)propan-1-amine | N,N-dimethyl-3-(2-tricyclo[9.4.0.03,8]pentadeca-1(15),3,5,7,11,13-hexaenylidene)propan-1-amine | CN(C)CCC=C1c2ccccc2CCc3ccccc13 | amitriptyline
ethyl2-(2-aminoethoxymethyl)-4-[[3-(2-chlorophenyl)-4-methoxy-4-oxobut-2-en-2-yl]amino]cyclopenta-1,3-diene-1-carboxylate | 3-O-ethyl5-O-methyl2-(2-aminoethoxymethyl)-4-(2-chlorophenyl)-6-methyl-1,4-dihydropyridine-3,5-dicarboxylate | CCOC(=O)C1=C(COCCN)NC(=C(C1c2ccccc2Cl)C(=O)OC)C | amlodipine
12-chloro-7-(diethylaminomethyl)-2,9-diazatricyclo[8.4.0.03,8]tetradeca-1(14),4,6,9,10,13-hexaen-6-ol | 4-[(7-chloroquinolin-4-yl)amino]-2-(diethylaminomethyl)phenol | CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O | amodiaquine
(2S,5R,6R)-5-[[(2R)-2-amino-2-(4-hydroxycyclohexa-1,3,5-trien-1-yl)acetyl]amino]-3,3-dimethyl-8-oxo-4-thia-1,7-diazabicyclo[4.3.0]nonane-2-carboxylicacid;tetrahydrate | (2S,5R,6R)-6-[[(2R)-2-amino-2-(4-hydroxyphenyl)acetyl]amino]-3,3-dimethyl-7-oxo-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylicacid;trihydrate | O.O.O.CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccc(O)cc3)C(=O)N2[C@H]1C(O)=O | amoxicillin
(1S,3S,5S,7S,9R,10R,13R,18S,19R,20R,21S,22Z,24Z,26Z,28Z,30Z,32Z,34Z,36Z,38Z,40S,41R)-1-[(2S,3S,4R,5S,6R)-4-amino-3,5-dihydroxy-6-[(2R,3S,4R,5S,6R)-5-amino-3,4-dihydroxyoxan-2-yl]oxan-2-yl]oxy-3,5,7,9,10,13,18,41-octahydroxy-19,20,21-trimethyl-15-oxo-4,16,42-trioxatricyclo[37.2.1.03,5]dotetraconta-22,24,26,28,30,32,34,36,38-nonaene-40-carboxylicacid | (1R,3S,5R,6R,9R,11R,15S,16R,17R,18S,19Z,21Z,23Z,25Z,27Z,29Z,31Z,33R,35S,36R,37S)-33-[(2R,3S,4S,5S,6R)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-1,3,5,6,9,11,17,37-octahydroxy-15,16,18-trimethyl-13-oxo-14,39-dioxabicyclo[33.3.1]nonatriaconta-19,21,23,25,27,29,31-heptaene-36-carboxylicacid | C[C@H]1O[C@@H](O[C@@H]\2C[C@@H]3O[C@](O)(C[C@@H](O)C[C@@H](O)[C@H](O)CC[C@@H](O)C[C@@H](O)CC(=O)O[C@@H](C)[C@H](C)[C@H](O)[C@@H](C)\C=C/C=C\C=C/C=C\C=C/C=C\C=C2)C[C@H](O)[C@H]3C(O)=O)[C@@H](O)[C@@H](N)[C@@H]1O | amphotericin B
(2S,5R,6R)-7-[[(2R)-2-amino-2-phenylacetyl]amino]-3,3-dimethyl-8-oxo-4-thia-1,7-diazabicyclo[3.3.0]octane-2-carboxylicacid | (2S,5R,6R)-6-[[(2R)-2-amino-2-phenylacetyl]amino]-3,3-dimethyl-7-oxo-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylicacid | CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccccc3)C(=O)N2[C@H]1C(O)=O | ampicillin
5-[3-(2-cyanopropan-2-yl)-6-(1,2,4-triazol-1-ylmethyl)cyclohexa-2,4-dien-1-yl]-2,2-dimethylbutanenitrile | 2-[3-(2-cyanopropan-2-yl)-5-(1,2,4-triazol-1-ylmethyl)phenyl]-2-methylpropanenitrile | CC(C)(C#N)c1cc(Cn2cncn2)cc(c1)C(C)(C)C#N | anastrozole
(4S,6R,7S,10S,11S,14S,15S,16S,20S,23R,26S)-16,17,23,26-tetrahydroxy-7-(4-hydroxycyclohexa-1,3,5-trien-1-yl)-11-(4-pentoxycyclohexa-2,4,6-trien-1-ylidene)-2-[[(2S,3S,4S)-3,4-dihydroxy-4-(4-hydroxycyclohexa-1,3,5-trien-1-yl)-2-[[(3S,4S,6R)-4-hydroxy-1-[(2S,3S)-3-hydroxybutan-2-yl]-2,6-dioxopiperazine-3-carbonyl]amino]butanoyl]amino]-14-methyl-2,5,12,17,24-hexazapentacyclo[24.2.2.218,21.04,10.06,14]dotriaconta-1(29),18(30),19,21(31),27,32-hexaene-3,11,13-trione | N-[(3S,6S,9S,11R,15S,18S,20R,21R,24S,25S,26S)-6-[(1S,2S)-1,2-dihydroxy-2-(4-hydroxyphenyl)ethyl]-11,20,21,25-tetrahydroxy-3,15-bis[(1S)-1-hydroxyethyl]-26-methyl-2,5,8,14,17,23-hexaoxo-1,4,7,13,16,22-hexazatricyclo[22.3.0.09,13]heptacosan-18-yl]-4-[4-(4-pentoxyphenyl)phenyl]benzamide | CCCCCOc1ccc(cc1)c2ccc(cc2)c3ccc(cc3)C(=O)N[C@H]4C[C@@H](O)[C@@H](O)NC(=O)[C@@H]5[C@@H](O)[C@@H](C)CN5C(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]6C[C@@H](O)CN6C(=O)[C@@H](NC4=O)[C@H](C)O)[C@H](O)[C@@H](O)c7ccc(O)cc7)[C@H](C)O | anidulafungin
14-[amino(oxo)methyl]-12-(4-methoxycyclohepta-2,4,6-trien-1-ylidene)-5-(2-oxopiperidin-1-yl)-4,11,12-triazatricyclo[7.3.2.14,8]pentadeca-1(13),6,8(15),10-tetraen-15-one | 1-(4-methoxyphenyl)-7-oxo-6-[4-(2-oxopiperidin-1-yl)phenyl]-4,5-dihydropyrazolo[3,4-c]pyridine-3-carboxamide | COc1ccc(cc1)n2nc(C(N)=O)c3CCN(C(=O)c23)c4ccc(cc4)N5CCCCC5=O | apixaban
(5R,6S)-5-(4-fluorocyclohepta-1,3,6-trien-1-yl)-6-[(1R)-1-[5,5,5-trifluoro-4-(trifluoromethyl)penta-1,3-dienyl]ethoxy]-1,2,5,6-tetrahydro-1,4,7-oxadiazocin-3-one | 5-[[(2S,3R)-2-[(1R)-1-[3,5-bis(trifluoromethyl)phenyl]ethoxy]-3-(4-fluorophenyl)morpholin-4-yl]methyl]-1,2-dihydro-1,2,4-triazol-3-one | C[C@@H](O[C@@H]1OCCN(CC2=NC(=O)NN2)[C@@H]1c3ccc(F)cc3)c4cc(cc(c4)C(F)(F)F)C(F)(F)F | aprepitant
arsorosooxy(oxo)arsane | oxoarsanyloxyarsenic | O=[As]O[As]=O | arsenic trioxide
(1R,4S,5R,8S,9R,10S,12S,13S)-10-methoxy-5,9-dimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecane | (1R,4S,5R,8S,9R,10S,12R,13R)-10-methoxy-1,5,9-trimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecane | CO[C@H]1O[C@@H]2O[C@@]3(C)CC[C@H]4[C@H](C)CC[C@@H]([C@H]1C)[C@@]24OO3 | artemether
4-oxo-4-[(1S,4R,5S,8S,9R,10S,15S)-4,9,12-trimethyl-11,16,17,18-tetraoxatetracyclo[10.3.2.05,15.08,15]heptadecan-10-yl]butanoicacid | 4-oxo-4-[[(4S,5R,8S,9R,10R,12R,13R)-1,5,9-trimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecan-10-yl]oxy]butanoicacid | C[C@@H]1CC[C@H]2[C@@H](C)[C@H](O[C@@H]3OC4(C)CC[C@@H]1[C@@]23OO4)OC(=O)CCC(O)=O | artesunate
5-(1,2-dihydroxyethyl)-4-methylidenefuran-2,3-diol | 2-(1,2-dihydroxyethyl)-4,5-dihydroxyfuran-3-one | OCC(O)C1OC(=C(O)C1=O)O | ascorbic acid
methylN-[(2S)-1-[2-[(2S,3S)-2-hydroxy-3-[[(2S)-2-(methoxycarbonylamino)-3,3-dimethylbutanoyl]amino]-4-phenylbutyl]-2-[(4-pyridin-2-ylcyclohexa-2,5-dien-1-yl)methyl]hydrazinyl]-3,3-dimethyl-1-oxobutan-2-yl]carbamate | methylN-[(2S)-1-[2-[(2S,3S)-2-hydroxy-3-[[(2S)-2-(methoxycarbonylamino)-3,3-dimethylbutanoyl]amino]-4-phenylbutyl]-2-[(4-pyridin-2-ylphenyl)methyl]hydrazinyl]-3,3-dimethyl-1-oxobutan-2-yl]carbamate | COC(=O)N[C@H](C(=O)N[C@@H](Cc1ccccc1)[C@@H](O)CN(Cc2ccc(cc2)c3ccccn3)NC(=O)[C@@H](NC(=O)OC)C(C)(C)C)C(C)(C)C | atazanavir
(3R,5R)-7-[2-(4-fluorocyclohepta-2,4,6-trien-1-ylidene)-3-phenyl-4-(phenylcarbamoyl)-5-propan-2-yl-3H-pyrrol-1-yl]-3,5-dihydroxyheptanoicacid | (3R,5R)-7-[2-(4-fluorophenyl)-3-phenyl-4-(phenylcarbamoyl)-5-propan-2-ylpyrrol-1-yl]-3,5-dihydroxyheptanoicacid | CC(C)c1n(CC[C@@H](O)C[C@@H](O)CC(O)=O)c(c2ccc(F)cc2)c(c3ccccc3)c1C(=O)Nc4ccccc4 | atorvastatin
5-[3-[1-[(4,5-dimethoxycyclohexa-1,3,5-trien-1-yl)methyl]-7,8-dimethoxy-2-methyl-1,3,4,6-tetrahydroisoquinolin-2-ium-2-yl]propanoyloxy]pentyl13-[4-[2-[4,5,6-trimethoxy-10-(4,5-dimethoxycyclohexa-2,4-dien-1-ylidene)cyclobut-2-en-1-yl]ethyl]-4-methyl-7-oxo-1-oxa-4-azoniacyclononan-1-yl]propanoate | 5-[3-[1-[(3,4-dimethoxyphenyl)methyl]-6,7-dimethoxy-2-methyl-3,4-dihydro-1H-isoquinolin-2-ium-2-yl]propanoyloxy]pentyl3-[1-[(3,4-dimethoxyphenyl)methyl]-6,7-dimethoxy-2-methyl-3,4-dihydro-1H-isoquinolin-2-ium-2-yl]propanoate | COc1ccc(CC2c3cc(OC)c(OC)cc3CC[N+]2(C)CCC(=O)OCCCCCOC(=O)CC[N+]4(C)CCc5cc(OC)c(OC)cc5C4Cc6ccc(OC)c(OC)c6)cc1OC | atracurium
(9-methyl-4-oxa-9-azabicyclo[4.2.1]nonan-5-yl)3-hydroxy-2-phenylpropanoate | (8-methyl-8-azabicyclo[3.2.1]octan-3-yl)3-hydroxy-2-phenylpropanoate | CN1C2CCC1CC(C2)OC(=O)C(CO)c3ccccc3 | atropine
[(2S,5R)-2-(carbamoyl)-7-oxo-1,6-diazabicyclo[3.2.1]octan-6-yl]hydrogensulfate | [(2S,5R)-2-carbamoyl-7-oxo-1,6-diazabicyclo[3.2.1]octan-6-yl]hydrogensulfate | NC(=O)[C@@H]1CC[C@@H]2CN1C(=O)N2OS(O)(=O)=O | avibactam
1-methyl-4-nitro-5-(7H-purin-6-ylsulfanyl)-4H-pyrimidine | 6-(3-methyl-5-nitroimidazol-4-yl)sulfanyl-7H-purine | Cn1cnc(c1Sc2ncnc3nc[nH]c23)[N+]([O-])=O | azathioprine
(2R,3S,5R,6S,7R,9S)-7-[(2R,4R)-5-[[(2R,3R,4R,5R)-4,5-dihydroxy-3-methoxy-5-methyloxan-2-yl]-methylamino]-2-hydroxy-4-methylpentan-2-yl]-9-[(2R,4S,5S,6S)-4-(dimethylamino)-5-hydroxypentan-2-yl]oxy-3-ethyl-6-hydroxy-2,6-dimethyl-4-[(2R,4R,5S,6S)-5-hydroxy-4-methoxy-4-methyloxan-2-yl]oxyoxonan-1-one | (2R,3S,4R,5R,8R,10R,11R,13S,14R)-11-[(2S,3R,4S,6R)-4-(dimethylamino)-3-hydroxy-6-methyloxan-2-yl]oxy-2-ethyl-3,4,10-trihydroxy-13-[(2R,4R,5S,6S)-5-hydroxy-4-methoxy-4,6-dimethyloxan-2-yl]oxy-3,5,6,8,10,12,14-heptamethyl-1-oxa-6-azacyclopentadecan-15-one | CC[C@H]1OC(=O)[C@H](C)[C@@H](O[C@H]2C[C@@](C)(OC)[C@@H](O)[C@H](C)O2)C(C)[C@@H](O[C@@H]3O[C@H](C)C[C@@H]([C@H]3O)N(C)C)[C@](C)(O)C[C@@H](C)CN(C)[C@H](C)[C@@H](O)[C@]1(C)O | azithromycin
barium(2+);sulfate | barium(2+);sulfate | [Ba++].[O-][S]([O-])(=O)=O | barium sulfate
(1S,10S,11S,13S,14S,15S,17S)-18-chloro-14,17-dihydroxy-14-(2-hydroxyacetyl)-13,15,18-trimethyltetracyclo[8.7.1.01,6.011,15]octadeca-2,5-dien-4-one | (8S,9R,10S,11S,13S,14S,16S,17R)-9-chloro-11,17-dihydroxy-17-(2-hydroxyacetyl)-10,13,16-trimethyl-6,7,8,11,12,14,15,16-octahydrocyclopenta[a]phenanthren-3-one | C[C@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(Cl)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO | beclometasone
(1R,2S)-1-(6-bromo-2-methoxyquinolin-3-yl)-4-(dimethylamino)-2-naphthalen-1-yl-1-phenylbutan-2-ol | (1R,2S)-1-(6-bromo-2-methoxyquinolin-3-yl)-4-(dimethylamino)-2-naphthalen-1-yl-1-phenylbutan-2-ol | COC1=NC2=C(C=C(Br)C=C2)C=C1[C@@H](C1=CC=CC=C1)[C@@](O)(CCN(C)C)C1=CC=CC2=C1C=CC=C2 | bedaquiline
11-[bis(2-chloroethyl)amino]-4-methyl-2,4-diazabicyclo[7.3.1]trideca-1(12),2,9-triene-3-carboxylicacid | 4-[5-[bis(2-chloroethyl)amino]-1-methylbenzimidazol-2-yl]butanoicacid | Cn1cCC(O)=O)nc2cc(ccc12)N(CCClCll | bendamustine
</body>
</html>
From the table above, I noticed that few prediction were different and I decided to check online to validate these compound and I realized Pubchem does that.I listed out the following drug with whose has different/slight prediction between Ersilia and STOUT .
Provided below is the link with the iupac_name of those drug from from Pubchem:
[Abracavir](https://pubchem.ncbi.nlm.nih.gov/compound/441300#section=Names-and-Identifiers)
[Abiraterone](https://pubchem.ncbi.nlm.nih.gov/compound/132971#section=Names-and-Identifiers)
[Acetazolamide](https://pubchem.ncbi.nlm.nih.gov/compound/1986#section=Names-and-Identifiers)
[Aclidinium](https://pubchem.ncbi.nlm.nih.gov/compound/11519741#section=Names-and-Identifiers)
[Afatinib](https://pubchem.ncbi.nlm.nih.gov/#query=afatinib)
[Amikacin](https://pubchem.ncbi.nlm.nih.gov/#query=amikacin)
[Amlodipine](https://pubchem.ncbi.nlm.nih.gov/#query=amlodipine)
[Anidulafungin](https://pubchem.ncbi.nlm.nih.gov/#query=anidulafungin)
[Apixaban](https://pubchem.ncbi.nlm.nih.gov/#query=apixaban)
[Ascorbic Acid](https://pubchem.ncbi.nlm.nih.gov/#query=ascorbic%20acid)
[Bendamustine](https://pubchem.ncbi.nlm.nih.gov/#query=bendamustine)
### **Observation**
I noticed that STOUT prediction is 100% accurate when in comparison with Pubchem iupac name WHILE Ersilia is 80% accurate with Pubchem IUPAC name
### **Task 5 - Install and run Docker!** **Day 20 (22nd October 2023)**
This is my first-hand experience dealing with docker. So I took my time to read and understand [docker documentation](https://docs.docker.com/get-started/) the installation process of docker, the functionality, and command-line interface.
These are the following steps and code I use to install docker on Ubuntu
1. Updated the local package using `sudo apt update`
2. Installed required dependencies
`sudo apt install -y apt-transport-https ca-certificates curl software-properties-common`
After running this, I got an error.
The error is: `E: Sub-process /usr/bin/dpkg returned an error code (1)`.
I checked online and saw that the error could be a result of broken dependencies. Then, I ran the following command to fix that and it worked
sudo dpkg --configure -a sudo apt --fix-broken install
This fixed the error I was getting
3. Added Docker's repository and Docker's official GPG key to verify the package system
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
4. Next, I Install Docker Engine using this command `sudo apt install -y docker-ce docker-ce-cli containerd.io`
5. Then, I started and enabled the docker
sudo systemctl start docker sudo systemctl enable docker
6. To verify if I successfully installed docker, I ran decided to check the version of docker installed using the command `docker --version`, and this [docker_output](https://github.com/ersilia-os/ersilia/files/13196593/docker_output.txt) shows it has been successfully installed.
7. I ran `docker ps` to test docker and my output was:
(base) ajoke@DESKTOP-KTJU3QV:~$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORT NAMES
This further explains that I've no container currently.
9. To ensure its functionality, I proceeded to test the docker by running `docker run hello-world` to pull and run a container. This is the resulting output below:
(base) ajoke@DESKTOP-KTJU3QV:~$ docker run hello-world
Hello from Docker! This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/
For more examples and ideas, visit: https://docs.docker.com/get-started/
Since this is my first-hand experience using Docker, I decided to explore by pulling a stout model from Ersilia's using Docker. I did the following:
- I pulled out model `eos4se9` using this command `docker pull ersiliaos/eos4se9` and I got the following output
(base) ajoke@DESKTOP-KTJU3QV:~$ docker pull ersiliaos/eos4se9 Using default tag: latest latest: Pulling from ersiliaos/eos4se9 8b91b88d5577: Pull complete 824416e23423: Pull complete bbe2c2981082: Pull complete 7b6b68d15a5c: Pull complete 71f8f4db541d: Pull complete 4f4fb700ef54: Pull complete 278266b40c52: Pull complete 4298588a86ad: Pull complete dddca77c0f59: Pull complete a113a2030c72: Pull complete 0c8571d61669: Pull complete Digest: sha256:3c0b4dab7a313bfb33c74b45ca378f7d69b0b9dbaaf843357780180910af31ab Status: Downloaded newer image for ersiliaos/eos4se9:latest docker.io/ersiliaos/eos4se9:latest
- Then I proceeded to run the model `eos4ee9` using this command `docker run ersiliaos/eos4se9`
- I ran docker ps and got this output
(base) ajoke@DESKTOP-KTJU3QV:~$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORT NAMES d6e3b2b2b5fd ersiliaos/eos4se9 "sh /root/docker-ent…" 50 seconds ago Up 15 seconds 83/tcp pedantic_antonelli
Hello @Ajoke23, From your updated comment here, it appears you were able to fetch the model and also completed all the week 1 tasks. Well done :clap:!
Please proceed to week 2 tasks and update each in a comment for faster follow-up. Incase you need any help, kindly let us know.
Hello @Ajoke23, From your updated comment here, it appears you were able to fetch the model and also completed all the week 1 tasks. Well done :clap:!
Please proceed to week 2 tasks and update each in a comment for faster follow-up. Incase you need any help, kindly let us know.
Yes, I was able to fetch the model successfully. Thanks a lot @HellenNamulinda
Hello @Ajoke23, You are yet to complete week 2 tasks. Is there any way we can support you?
Hello @Ajoke23, You are yet to complete week 2 tasks. Is there any way we can support you?
Hi @HellenNamulinda I had lot of challenges doing the week 2 task based on my model of interest, STOUT. But I have been able to figure it out and I will update it soonest. Thank you
Hi @Ajoke23 thank you for the updates. Let us know how it goes! :)
Date of Publication: 13th September, 2023 Publication: Genome Medicine License: Creative Commons Dataset: Dataset Used Source Code: https://github.com/ChloeHJ/TRAP Code: Python and R Slug: TRAP
TRAP model utilizes the use of deep learning for prediction of immunogenicity and decision tree classifier for estimating the degree of correctness. It utilize the following features such as: amino acids at contact position, hydrophobicity, large and aromatic side chains, peptide-MHC binding affinity which correlates to the recognition of T-cell and robust prediction of CD8+ T-cell epitopes from MHC-I ligands.
This code has a well detailed installation process. The following steps are needed in installing TRAP models:
git clone https://github.com/ChloeHJ/TRAP.git
on Ubuntu and I got this output below which explicitly indicate the successfully cloning of the repository
(base) ajoke@DESKTOP-KTJU3QV:~$ git clone https://github.com/ChloeHJ/TRAP.git
Cloning into 'TRAP'...
remote: Enumerating objects: 60, done.
remote: Counting objects: 100% (60/60), done.
remote: Compressing objects: 100% (53/53), done.
remote: Total 60 (delta 22), reused 10 (delta 4), pack-reused 0
Receiving objects: 100% (60/60), 7.94 MiB | 643.00 KiB/s, done.
Resolving deltas: 100% (22/22), done
conda create -n trap python=3.9
conda activate trap
pip install -r requirements.txt
Publication: BMC Bioinformatics Year of Publication: 2023 Authors: Yue Wu, Xinran Ni, Zhihao Wang & Weike Feng Slug: FREL Source Code: https://github.com/Ruowu9944/FREL Dataset: GraphMVP,MoleculeNet License: None Code: Python
The model incorporates neural network specifically FRagment-based dual-channEL pretraining (FREL) which uses generative learning and contrastive learning techniques to achieve intra- and inter-molecular agreement. The molecular fragments provides a deeper understanding of underlying molecular mechanisms which will help researchers in customizing drug design that are tailored to specific diseases and patient populations. Research shows that learned molecular representations better capture the drug property variation, fragment semantics which provides insightful relationship between molecules fragment and drug discovery
The following version of dependencies must be met
numpy 1.21.2
scikit-learn 1.0.2
pandas 1.3.4
python 3.7.11
torch 1.10.2+cu113
torch-geometric 2.0.3
transformers 4.17.0
rdkit 2020.09.1.0
ase 3.22.1
descriptastorus 2.3.0.5
ogb 1.3.3
cd datasets
python molecule_preparation.py
cd src
python pretrain_cls.py --dropout_ratio=0
python pretrain_reg.py --dropout_ratio=0
-Fine-tune classification model, run the following code:
python finetune_cls.py --dropout_ratio=0.5 --dataset=bace
Pre-train regression model, run the following code:
python finetune_reg.py --dropout_ratio=0.5 --dataset=esol
Date of Publication: 29th May, 2023 Publication: Journal of Cheminformatics Author: Umit V. Ucak, Islambek Ashyrmamatov & Juyoung Lee Dataset: data Source Code: https://github.com/snu-lcbc/atom-in-SMILES Slug: AIS Code: Python License: CC BY-SA 4.0
DESCRIPTION atoms-in-Smiles uses the principle of tokenization schemes which is a preprocessing step in NLP (Natural Language Processing). The fall short of accuracy of traditional SMILES not been able to reflect true nature of molecules gave rise to atoms-In-Smiles (AIS). These tokenization provides provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models. This solves problem of: Single-step retrosynthesis, Molecular Property Prediction, Normalized repetition rate, Fingerprint nature of AIS, Single-token repetition (rep-l), input-output equivalent mapping
Accuracy of molecular property prediction depends on the quality of chemical language models. Molecular structure are useful for researches when developing new drug for treatment of infectious diseases. The relevance of chemical model for drug discovery of diseases makes it relevant to Ersilia.
The code is well documented and I was able set it up doing the following with the use of Google Collab
pip install git+https://github.com/snu-lcbc/atom-in-SMILES
and this repository.txt shows I've successfully cloned the repository!pwd
!pip3 install selfies
!pip3 install --upgrade deepsmiles
!pip3 install SmilesPE > SmilesPE.txt
!pip3 install seaborn==0.12.2 > seaborn.txt
!pip3 install rdkit
!pip3 install atomInSmiles > ais.txt
The output are: SmilesPE, deepsmiles, seaborn, atomsInSmiles
This model has various implementation such as: Single-step retrosynthesis, Molecular Property Prediction, Normalized repetition rate, Fingerprint nature of AIS, Single-token repetition (rep-l), input-output equivalent mapping. I will be working on implementing Normalized repetition rate which describes: Natural products, drugs, metal complexes, lipids, steroids', isomers. To achieve this, I use python code and the code is as shown below:
#importing the necessary libaries
import codecs
import tarfile
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from rdkit import Chem
import atomInSmiles
import selfies as sf
import deepsmiles
from SmilesPE.tokenizer import SPE_Tokenizer
from SmilesPE.tokenizer import atomwise_tokenizer
sns.set_theme()
def smiles_tokenizer(smi):
#Tokenize a SMILES molecule or reaction
import re
pattern = "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
regex = re.compile(pattern)
tokens = [token for token in regex.findall(smi)]
assert smi == ''.join(tokens), f"{smi=}\t {''.join(tokens)=}"
return str ' '.join(tokens)
def get_rep(sent, l = 300):
cnt = 0
for i, w in enumerate(sent):
if w in sent[max(i - l, 0):i]:
cnt += 1
return cnt
#repld = get_rep(test)
#repld
with tarfile.open('/datum.tar.gz') as tarf:
tarf.extractall('data')
print(f"Extracting files...")
data_files = {
'data/steroids_final.data': 'Stereoids',
'data/metals_final.data': 'Metal complexes',
'data/fda_final.data': 'FDA approved drugs',
'data/lipids_final.data': 'Lipids',
'data/naturals_final.data': 'Natural products',
'data/isomer.data': 'Isomers of octane',
}
def create_catplot(data_files):
for csv_file, subtitle in data_files.items():
# Load data
print(csv_file)
df = pd.read_csv(csv_file, sep='\t', header=None)
df.columns = ['Token types', 'Repetition', 'Normalized repetition', 'Length', 'Unique tokens']
# Create catplot
catplot = sns.catplot(
data=df, x="Token types", y="Normalized repetition", hue="Unique tokens",
native_scale=True, zorder=1
)
catplot.set(ylim=(-0.03, 1.0))
catplot.set_xticklabels(rotation=90)
catplot.set_xlabels("Token types", fontsize=14)
catplot.set_ylabels("Normalized repetition", fontsize=14)
# Set font size for hue legend
#catplot.ax.legend(title="Unique tokens", fontsize=14)
catplot.ax.tick_params(axis='x', labelsize=14)
catplot.ax.tick_params(axis='y', labelsize=14)
# Map the Token types to integers
mapping = {'DeepSMILES': 0, 'SMILES': 1, 'SELFIES': 2, 'AIS': 3, 'SmilesPE': 4}
df['Token types'] = df['Token types'].map(mapping)
# Compute mean and standard deviation for each Token type
mean_vals = df.groupby(['Token types'])['Normalized repetition'].mean()
std_vals = df.groupby(['Token types'])['Normalized repetition'].std()
# Plot the mean values and error bars
for i, (mean_val, std_val) in enumerate(zip(mean_vals, std_vals)):
x_pos = i # the x position of the horizontal line
y_pos = mean_val # the y position of the horizontal line
#color = sns.color_palette()[i] # the color of the horizontal line
plt.plot([x_pos + 0.2, x_pos + 0.4], [y_pos, y_pos], color='black', linestyle='-', linewidth=1)
plt.plot([x_pos + 0.3, x_pos + 0.3], [y_pos - std_val, y_pos + std_val], linestyle=':',color='black', linewidth=1)
# Add title and adjust margins
catplot.fig.suptitle(subtitle, fontsize=14, fontweight='bold')
plt.subplots_adjust(top=0.93, bottom=0.3)
plt.gcf().set_size_inches(6, 6)
# Save the plot
plt.savefig(csv_file[:-5] + 'Ho.png')
# Close the plot to free up memory
# plt.close()
#Example usage
create_catplot(data_files)
The distributions show the unique characteristics of tokenization schemes on representative datasets, designed to test different facets of molecular structures such as coordination compounds, ligands (metal complexes), ring structures and functional groups (steroids), long-chain formations (phospholipids, ionizable lipids), complex and diverse structures (natural products)
To avoid duplication of model, I checked the list of pending model yet to be incorporated and I searched through my 3 models I suggested and none was found in the list. Hence, all the 3 models suggested are new
Hello,
Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application