ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Ajoke Yusuf #842

Closed Ajoke23 closed 8 months ago

Ajoke23 commented 9 months ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

Ajoke23 commented 9 months ago

WEEK 1 DAY 1 (3rd October 2023)

DAY 2 (4th October, 2023)

DAY 3 (5th October 2023)

DAY 4 (6th October 2023)

DAY 5 (7th October 2023)

DAY 6 (8th October 2023)

DhanshreeA commented 9 months ago

Hi @Ajoke23 thank you for the updates. I see that some items from the week 1 tasks are still pending. Please tell us if you'd like any support in completing them.

Ajoke23 commented 9 months ago

Hi @Ajoke23 thank you for the updates. I see that some items from the week 1 tasks are still pending. Please tell us if you'd like any support in completing them.

Yes, I need support. I am finding it hard to fetch model eos3b5e. I am getting errors regarding connections and I have asked on the slack channels, people gave suggestions and I have tried all that but it isn't working yet. I went online and saw some related post on the issue on ersilia's repository but none of the suggestion has worked. I am still trying my best to figure it out. I will appreciate any help from you

leilayesufu commented 9 months ago

Hi, have you been able to fetch it?

Ajoke23 commented 9 months ago

MOTIVATION STATEMENT

I'm Ajoke Yusuf, a Data Scientist, Machine Learning enthusiast, and SDG 3 advocate. I'm a resourceful, goal-oriented individual who possesses strong analytical and problem-solving skills with an unending quest for knowledge. I pride myself on being a fast learner and honed strong skills in problem-solving and research. After receiving the Outreachy email, one of my aims before choosing a project is to check out the project whose aim, and mission align with my goal and career objective as an impact maker and an SDG 3 advocate.

I went through each of the projects and I came across Ersilia's project whose mission statement is:

"To equip laboratories in Low and Middle Income Countries with state of the art AI/ML tools for infectious and neglected disease research."

As an Engineering graduate living in Nigeria, I developed an interest in the biomedical field due to the increasing mortality rate of infectious diseases in Nigeria and sub-Saharan Africa. Based on statistics, according to UNICEF (United Nations International Children Emergency Funds):

infectious disease is the major cause of the mortality rate in children ≤ 5 years

This was cited from here.

Research from NIH (National Library of Medicine) & NCBI (National Centre for Biotechnology Information) confirms that:

"The infrastructure and level of support for surveillance, research, and training on emerging infectious diseases
in Africa are extremely limited".

Link here

As a Data Scientist, skilled at Python, Machine learning, I possess strong analytical and research skills. I believe that contributing to this project will help me garner knowledge and technical skills that will help in advancing and improving health research in Nigeria.

If accepted for the 3 months internship, I'll commit myself to bringing suggestions, undergoing research, and collaborating with the Ersilia team while learning and honing skills in Artificial Intelligence and Machine Learning. This period of internship will help propel my research and problem-solving skills which will be useful in the long run for the advancement of technology in the health sector, improving and making a sustainable impact in health research in Nigeria.

As a young lady living in Nigeria, an underdeveloped and low-income country, I have experienced challenges encountered in having accessibility tools for prevalent infectious diseases in my community and country (Nigeria) at large.

After the internship, I plan on utilizing the skills gained to improve and sustain health research tools, solve prevalent health disease issues in Nigeria, and reduce the mortality rate caused by infectious diseases. Thus, propelling sustainable research skills that will leave a long-lasting impact in the health sector in my community, Nigeria, sub-Saharan Africa, and eventually, globally. Also, I'll continue to contribute my quota to further success of Ersilia's project

Ajoke23 commented 9 months ago

WEEK 2

Task 1 - SELECT A MODEL FROM THE SUGGESTED LIST

Day 7 (9th October, 2023)

I selected the STOUT (SMILES to IUPAC) model and the reason for choosing this model for implementation includes the following: A. INTEREST IN THE APPLICATION: I can remember vividly when I was still in high school as a science student, I've always had difficulty in naming the nomenclature of chemical compounds so seeing a Machine Learning model that could do that, suddenly ignited my interest. In the health sector, IUPAC names are useful in communicating the structure and properties of potential drugs, aid in the development of drugs, and are useful in understanding the mechanism of action & metabolism of how drugs work in the body. As an SDG 3 advocate, this made me interested to further delving deeper into how building such a model is achieved because as a problem solver, I would love to incorporate the knowledge I gained in working with the model to help solve infectious diseases and sustain research problems in the health sector in Nigeria thus and eventually, would lead to the sustainability of scientific leadership with researchers B. ML ALGORITHMS USED: In the journal provided in the repository, it was stated that the model uses a deep learning method specifically NMT (Neural Machine Translation) which follows the implementation of Google NMT models for SMILES to IUPAC name translation. I want to understand the knowledge and thought process behind the implementation. C. GOAL TO BE ACHIEVED: As a machine learning enthusiast, data scientist, problem solver, and SDG 3 advocate, this will give me have deeper understanding and technical knowledge to execute tasks and solve problems easily. This knowledge regarding NMT will make it easier to collaborate and build a model that will serve as a tool for researchers interested in working and solving infectious disease problems in Nigeria and globally. Thus, reducing the mortality rate of infectious diseases in Nigeria (a low-income country).

Task 2 - INSTALL THE MODEL TO YOUR SYSTEM

Day 8 (10th October, 2023)

I followed the installation instructions on the STOUT model GitHub repository.

Day 9 (11th October,2023) - Day 15 (17th October, 2023)

Day 16 (18th October, 2023)

SMILES to IUPAC name translation

SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" IUPAC_name = translate_forward(SMILES) print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

IUPAC name to SMILES translation

IUPAC_name = "1,3,7-trimethylpurine-2,6-dione" SMILES = translate_reverse(IUPAC_name) print("SMILES of "+IUPAC_name+" is: "+SMILES)

which gave the following [output.txt](https://github.com/ersilia-os/ersilia/files/13169158/output.txt)

### **TASK 3 - RUN PREDICTION FOR THE EML PROVIDED**
DAY 17 (19 October, 2023)
- I clicked on this [link](https://raw.githubusercontent.com/ersilia-os/ersilia/master/notebooks/eml_canonical.csv), right click and select 'save as' which automatically downloaded the file. The downloaded file named is: [eml_canonical.csv](https://github.com/ersilia-os/ersilia/files/13170404/eml_canonical.csv).
- To run the prediction for Essential Medicine list, I demonstrated my knowledge of python to be able to achieve it

from google.colab import files uploaded = files.upload()

importing necessary libraries

import pandas as pd import io from STOUT import translate_forward

reading the file into dataframe

df = pd.read_csv(io.BytesIO(uploaded['eml_canonical.csv'])) print(df)

selecting the first 40 rows

df = df.head(40)

function to translate smile to IUPAC

def smiles_to_iupac(smiles): iupac = translate_forward(smiles) return iupac

function to translate CAN-SMILES to IUPAC

def can_smiles_to_iupac(can_smiles): iupac = translate_forward(can_smiles) return iupac

Creating a new column for the iupac name by using the .apply function

df['smiles_iupac']=df['smiles'].apply(smiles_to_iupac) df['can_smiles_iupac'] = df['can_smiles'].apply(can_smiles_to_iupac)

Use .loc to assign values

df.loc[:, 'smiles_iupac'] = df['smiles'].apply(smiles_to_iupac) df.loc[:, 'can_smiles_iupac'] = df['can_smiles'].apply(can_smiles_to_iupac)

Creating smiles_iupac dataframe

smiles_iupac = df[['drugs', 'smiles', 'smiles_iupac']].copy() smiles_iupac

Creating can_smiles_iupac dataframe

can_smiles_iupac = df[['drugs', 'can_smiles', 'can_smiles_iupac']].copy() can_smiles_iupac

Saving these dataframes to separate CSV files to show the output

smiles_iupac.to_csv('smiles_iupac.csv', index=False) can_smiles_iupac.to_csv('can_smiles_iupac.csv', index=False)

Due to the running time of executing large volume of data on Google Collab, I decided to limit the prediction to the first 40 molecules of eml_canonical dataset
The output of the code is:
[smiles_iupac.csv](https://github.com/ersilia-os/ersilia/files/13170854/smiles_iupac.csv)
[can_smiles_iupac.csv](https://github.com/ersilia-os/ersilia/files/13170855/can_smiles_iupac.csv)

### **Task 4- Compare results with the Ersilia Model Hub implementation!**
Day 18 (20th October, 2023)

- I started the process by searching for the STOUT model identifier on [Ersilia Model Hub](https://www.ersilia.io/model-hub)
- On seeing the STOUT: SMILES to IUPAC name translator on  [Ersilia Model Hub](https://www.ersilia.io/model-hub), I clicked on the [GitHub](https://github.com/ersilia-os/eos4se9) button 
- The STOUT model has an  EOS model ID: `eos4se9` and the name of the Slug is: `smiles2iupac`
-  I used the following code to fetch, serve and run the model prediction

ersilia -v fetch eos4se9 ersilia -v serve eos4se9 ersilia -v api run -i smiles_iupac.csv -o smilesoutput.csv

I successfully fetched and [served](https://github.com/ersilia-os/ersilia/files/13184630/modelserve.log) the model but the output after running the model prediction I noticed the iupacs_names columns was empty which implied that i got no output i.e there was no iupac name translation for smile input. 
Output file: [smilesoutput.csv](https://github.com/ersilia-os/ersilia/files/13184672/smilesoutput.csv)
- Trying to debug where the problem came from, I decided to run the model with an input string using this command `ersilia -v api run -i "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"`. Output shown below:

{ "input": { "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N", "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1", "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1" }, "output": { "outcome": [ null ] }

I was expecting to get an iupac_name but I got a null value as outcome.

Day 19 (21th October, 2023)

- In the process of debugging, I  noticed that another contributor also faced the same challenges and I saw @HellenNamulinda [suggestion](https://github.com/ersilia-os/ersilia/issues/821#issuecomment-1759694897) regarding the issue so I tried the option of fetching the model from GitHub, by adding the `--from_github` flag in the command.
-  Command used :` ersilia -v fetch eos4se9 --from_github > eos4se9_fetch_github.log 2>&1` and the output can be found [eos4se9_fetch_github.log](https://github.com/ersilia-os/ersilia/files/13187659/eos4se9_fetch_github.log). This show I successfully fetched the model.
- I went ahead to run a model prediction again with an input string using the command I previously used `ersilia -v api run -i 'Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1`' and below the output . Log file of the output can be found [input_outcome.log](https://github.com/ersilia-os/ersilia/files/13187950/input_outcome.log)

{ "input": { "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N", "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1", "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1" }, "output": { "outcome": [ "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol" ] } }

- I decided to try another input command using this code: `ersilia -v api run -i  "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"`
Output:

{ "input": { "key": "NQQBNZBOOHHVQP-UHFFFAOYSA-N", "input": "C1=C(SC(=N1)SC2=NN=C(S2)N)N+[O-]", "text": "C1=C(SC(=N1)SC2=NN=C(S2)N)N+[O-]" }, "output": { "outcome": [ "5-[(5-nitro-1,3-thiazol-2-yl)sulfanyl]-1,3,4-thiadiazol-2-amine" ] } }

**Summary** 
Using this command below for smiles Ersilia STOUT prediction using the first 40 molecules in the EML dataset

ersilia -v fetch eos4se9 --from_github > eos4se9_fetch_github.log 2>&1 ersilia -v serve eos4se9 > eos4se9_serve_model.log 2>&1 ersilia -v api run -i smiles_iupac.csv -o smiles_output.csv

[ersilia_smiles_output](https://github.com/ersilia-os/ersilia/files/13189450/smiles_output.csv)-  Ersilia STOUT prediction of smiles to iupac

### **COMPARISON OF SMILES OUTPUT PREDICTION USING ERSILIA AND STOUT PREDICTION**

ERSILIA prediction output: [smiles_output.csv](https://github.com/ersilia-os/ersilia/files/13189683/smiles_output.csv)
STOUT prediction output: [smiles_iupac.csv](https://github.com/ersilia-os/ersilia/files/13189687/smiles_iupac.csv)

Since I've two different csv file that contains ersilia prediction output and STOUT prediction output. I decided to merge the two datasets together using my knowledge of Python and selecting the necessary columns to be shown as ouput.
using the Python code below

importing necessary libraries

import pandas as pd

Loading the two datasets

smiles_iupac is the STOUT prediction while smiles_output is Ersilia prediction

smiles_iupac = pd.read_csv(r"\wsl.localhost\Ubuntu\home\ajoke\smiles_iupac.csv") smiles_output = pd.read_csv(r"\wsl.localhost\Ubuntu\home\ajoke\smiles_output.csv")

Merging the datasets on the 'smiles' and 'input' columns

merged_dataset = smiles_iupac.merge(smiles_output, left_on='smiles', right_on='input')

Renaming the columns as specified

merged_dataset.rename(columns={'input': 'input/smiles', 'iupacs_names': 'Ersilia STOUT prediction', 'smiles_iupac': 'STOUT prediction'}, inplace=True)

Keeping only the desired columns

merged_dataset = merged_dataset[[ 'Ersilia STOUT prediction', 'STOUT prediction', 'input/smiles', 'drugs']] merged_dataset

Saving the dataset to a new CSV file

merged_dataset.to_csv('comparison_dataset.csv', index=False)

[merged_ouput](https://github.com/ersilia-os/ersilia/files/13196354/comparison_dataset.csv) - Output of the merged dataset in csv format. Below is the table format
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/HP/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/HP/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

<!--table
    {mso-displayed-decimal-separator:"\.";
    mso-displayed-thousand-separator:"\,";}
@page
    {margin:.75in .7in .75in .7in;
    mso-header-margin:.3in;
    mso-footer-margin:.3in;}
tr
    {mso-height-source:auto;}
col
    {mso-width-source:auto;}
br
    {mso-data-placement:same-cell;}
td
    {padding-top:1px;
    padding-right:1px;
    padding-left:1px;
    mso-ignore:padding;
    color:black;
    font-size:11.0pt;
    font-weight:400;
    font-style:normal;
    text-decoration:none;
    font-family:Calibri, sans-serif;
    mso-font-charset:0;
    mso-number-format:General;
    text-align:general;
    vertical-align:bottom;
    border:none;
    mso-background-source:auto;
    mso-pattern:auto;
    mso-protection:locked visible;
    white-space:nowrap;
    mso-rotate:0;}
-->
</head>
<body link="#0563C1" vlink="#954F72">

Ersilia STOUT   prediction | STOUT prediction | input/smiles | drugs
-- | -- | -- | --
[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol | [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol | Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 | abacavir
(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol | (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol | C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5 | abiraterone
N-[5-[amino(dioxo)-λ6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide | N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide | CC(=O)Nc1sc(nn1)[S](N)(=O)=O | acetazolamide
aceticacid | aceticacid | CC(O)=O | acetic acid
(2R)-2-acetamido-3-sulfanylpropanoicacid | (2R)-2-acetamido-3-sulfanylpropanoicacid | CC(=O)N[C@@H](CS)C(O)=O | acetylcysteine
2-acetyloxybenzoicacid | 2-acetyloxybenzoicacid | CC(=O)Oc1ccccc1C(O)=O | acetylsalicylic acid
2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one | 2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one | NC1=NC(=O)c2ncn(COCCO)c2N1 | aciclovir
2-[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]oxy-1,1-dithiophen-2-ylethanol | [(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]2-hydroxy-2,2-dithiophen-2-ylacetate | OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1 | aclidinium
(E)-N-[6-[[(3-chloro-4-fluorocyclohexa-1,4-dien-1-yl)amino]methylidene]-3-[(3S)-oxolan-3-yl]oxycyclopenta[d]pyrimidin-2-yl]-4-(dimethylamino)but-2-enamide | (E)-N-[4-(3-chloro-4-fluoroanilino)-7-[(3S)-oxolan-3-yl]oxyquinazolin-6-yl]-4-(dimethylamino)but-2-enamide | CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1 | afatinib
methylN-(6-propylsulfanyl-1H-benzimidazol-2-yl)carbamate | methylN-(6-propylsulfanyl-1H-benzimidazol-2-yl)carbamate | CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1 | albendazole
1,2-dihydropyrazolo[3,4-d]pyrimidin-4-one | 1,2-dihydropyrazolo[3,4-d]pyrimidin-4-one | O=C1N=CN=C2NNC=C12 | allopurinol
5-acetamido-2,4,6-triiodo-3-(1-oxoethylamino)cyclohexa-4,6-diene-1-carboxylicacid | 3,5-diacetamido-2,4,6-triiodobenzoicacid | CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(O)=O)c1I | amidotrizoate
(2S)-N-[(1R,2R,3R,5S,6R)-5-amino-2-[(2R,3R,4R,5R,6R)-3-amino-4,5,6-trihydroxyoxan-2-yl]oxy-3-[(2R,3S,4R,5R)-5-amino-1,3,4,6-tetrahydroxyhexan-2-yl]oxy-1-hydroxyoxetan-6-yl]-2-hydroxy-4-(methylamino)butanamide | (2S)-4-amino-N-[(1R,2S,3S,4R,5S)-5-amino-2-[(2S,3R,4S,5S,6R)-4-amino-3,5-dihydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-[(2R,3R,4S,5S,6R)-6-(aminomethyl)-3,4,5-trihydroxyoxan-2-yl]oxy-3-hydroxycyclohexyl]-2-hydroxybutanamide | NCC[C@H](O)C(=O)N[C@@H]1C[C@H](N)[C@@H](O[C@H]2O[C@H](CN)[C@@H](O)[C@H](O)[C@H]2O)[C@H](O)[C@H]1O[C@H]3O[C@H](CO)[C@@H](O)[C@H](N)[C@H]3O | amikacin
3,5-diamino-2-chloro-N-(diaminomethylidene)-2H-pyrazine-6-carboxamide | 3,5-diamino-6-chloro-N-(diaminomethylidene)pyrazine-2-carboxamide | NC(N)=NC(=O)c1nc(Cl)c(N)nc1N | amiloride
2-butyl-3-[4-[2-(diethylamino)ethoxy]-3,5-diiodocyclohexa-1,4-dien-1-yl]chromen-4-one | (2-butyl-1-benzofuran-3-yl)-[4-[2-(diethylamino)ethoxy]-3,5-diiodophenyl]methanone | CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3 | amiodarone
N,N-dimethyl-3-(2-tricyclo[9.4.0.03,8]pentadeca-1(15),3,5,7,11,13-hexaenylidene)propan-1-amine | N,N-dimethyl-3-(2-tricyclo[9.4.0.03,8]pentadeca-1(15),3,5,7,11,13-hexaenylidene)propan-1-amine | CN(C)CCC=C1c2ccccc2CCc3ccccc13 | amitriptyline
ethyl2-(2-aminoethoxymethyl)-4-[[3-(2-chlorophenyl)-4-methoxy-4-oxobut-2-en-2-yl]amino]cyclopenta-1,3-diene-1-carboxylate | 3-O-ethyl5-O-methyl2-(2-aminoethoxymethyl)-4-(2-chlorophenyl)-6-methyl-1,4-dihydropyridine-3,5-dicarboxylate | CCOC(=O)C1=C(COCCN)NC(=C(C1c2ccccc2Cl)C(=O)OC)C | amlodipine
12-chloro-7-(diethylaminomethyl)-2,9-diazatricyclo[8.4.0.03,8]tetradeca-1(14),4,6,9,10,13-hexaen-6-ol | 4-[(7-chloroquinolin-4-yl)amino]-2-(diethylaminomethyl)phenol | CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O | amodiaquine
(2S,5R,6R)-5-[[(2R)-2-amino-2-(4-hydroxycyclohexa-1,3,5-trien-1-yl)acetyl]amino]-3,3-dimethyl-8-oxo-4-thia-1,7-diazabicyclo[4.3.0]nonane-2-carboxylicacid;tetrahydrate | (2S,5R,6R)-6-[[(2R)-2-amino-2-(4-hydroxyphenyl)acetyl]amino]-3,3-dimethyl-7-oxo-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylicacid;trihydrate | O.O.O.CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccc(O)cc3)C(=O)N2[C@H]1C(O)=O | amoxicillin
(1S,3S,5S,7S,9R,10R,13R,18S,19R,20R,21S,22Z,24Z,26Z,28Z,30Z,32Z,34Z,36Z,38Z,40S,41R)-1-[(2S,3S,4R,5S,6R)-4-amino-3,5-dihydroxy-6-[(2R,3S,4R,5S,6R)-5-amino-3,4-dihydroxyoxan-2-yl]oxan-2-yl]oxy-3,5,7,9,10,13,18,41-octahydroxy-19,20,21-trimethyl-15-oxo-4,16,42-trioxatricyclo[37.2.1.03,5]dotetraconta-22,24,26,28,30,32,34,36,38-nonaene-40-carboxylicacid | (1R,3S,5R,6R,9R,11R,15S,16R,17R,18S,19Z,21Z,23Z,25Z,27Z,29Z,31Z,33R,35S,36R,37S)-33-[(2R,3S,4S,5S,6R)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-1,3,5,6,9,11,17,37-octahydroxy-15,16,18-trimethyl-13-oxo-14,39-dioxabicyclo[33.3.1]nonatriaconta-19,21,23,25,27,29,31-heptaene-36-carboxylicacid | C[C@H]1O[C@@H](O[C@@H]\2C[C@@H]3O[C@](O)(C[C@@H](O)C[C@@H](O)[C@H](O)CC[C@@H](O)C[C@@H](O)CC(=O)O[C@@H](C)[C@H](C)[C@H](O)[C@@H](C)\C=C/C=C\C=C/C=C\C=C/C=C\C=C2)C[C@H](O)[C@H]3C(O)=O)[C@@H](O)[C@@H](N)[C@@H]1O | amphotericin B
(2S,5R,6R)-7-[[(2R)-2-amino-2-phenylacetyl]amino]-3,3-dimethyl-8-oxo-4-thia-1,7-diazabicyclo[3.3.0]octane-2-carboxylicacid | (2S,5R,6R)-6-[[(2R)-2-amino-2-phenylacetyl]amino]-3,3-dimethyl-7-oxo-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylicacid | CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccccc3)C(=O)N2[C@H]1C(O)=O | ampicillin
5-[3-(2-cyanopropan-2-yl)-6-(1,2,4-triazol-1-ylmethyl)cyclohexa-2,4-dien-1-yl]-2,2-dimethylbutanenitrile | 2-[3-(2-cyanopropan-2-yl)-5-(1,2,4-triazol-1-ylmethyl)phenyl]-2-methylpropanenitrile | CC(C)(C#N)c1cc(Cn2cncn2)cc(c1)C(C)(C)C#N | anastrozole
(4S,6R,7S,10S,11S,14S,15S,16S,20S,23R,26S)-16,17,23,26-tetrahydroxy-7-(4-hydroxycyclohexa-1,3,5-trien-1-yl)-11-(4-pentoxycyclohexa-2,4,6-trien-1-ylidene)-2-[[(2S,3S,4S)-3,4-dihydroxy-4-(4-hydroxycyclohexa-1,3,5-trien-1-yl)-2-[[(3S,4S,6R)-4-hydroxy-1-[(2S,3S)-3-hydroxybutan-2-yl]-2,6-dioxopiperazine-3-carbonyl]amino]butanoyl]amino]-14-methyl-2,5,12,17,24-hexazapentacyclo[24.2.2.218,21.04,10.06,14]dotriaconta-1(29),18(30),19,21(31),27,32-hexaene-3,11,13-trione | N-[(3S,6S,9S,11R,15S,18S,20R,21R,24S,25S,26S)-6-[(1S,2S)-1,2-dihydroxy-2-(4-hydroxyphenyl)ethyl]-11,20,21,25-tetrahydroxy-3,15-bis[(1S)-1-hydroxyethyl]-26-methyl-2,5,8,14,17,23-hexaoxo-1,4,7,13,16,22-hexazatricyclo[22.3.0.09,13]heptacosan-18-yl]-4-[4-(4-pentoxyphenyl)phenyl]benzamide | CCCCCOc1ccc(cc1)c2ccc(cc2)c3ccc(cc3)C(=O)N[C@H]4C[C@@H](O)[C@@H](O)NC(=O)[C@@H]5[C@@H](O)[C@@H](C)CN5C(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]6C[C@@H](O)CN6C(=O)[C@@H](NC4=O)[C@H](C)O)[C@H](O)[C@@H](O)c7ccc(O)cc7)[C@H](C)O | anidulafungin
14-[amino(oxo)methyl]-12-(4-methoxycyclohepta-2,4,6-trien-1-ylidene)-5-(2-oxopiperidin-1-yl)-4,11,12-triazatricyclo[7.3.2.14,8]pentadeca-1(13),6,8(15),10-tetraen-15-one | 1-(4-methoxyphenyl)-7-oxo-6-[4-(2-oxopiperidin-1-yl)phenyl]-4,5-dihydropyrazolo[3,4-c]pyridine-3-carboxamide | COc1ccc(cc1)n2nc(C(N)=O)c3CCN(C(=O)c23)c4ccc(cc4)N5CCCCC5=O | apixaban
(5R,6S)-5-(4-fluorocyclohepta-1,3,6-trien-1-yl)-6-[(1R)-1-[5,5,5-trifluoro-4-(trifluoromethyl)penta-1,3-dienyl]ethoxy]-1,2,5,6-tetrahydro-1,4,7-oxadiazocin-3-one | 5-[[(2S,3R)-2-[(1R)-1-[3,5-bis(trifluoromethyl)phenyl]ethoxy]-3-(4-fluorophenyl)morpholin-4-yl]methyl]-1,2-dihydro-1,2,4-triazol-3-one | C[C@@H](O[C@@H]1OCCN(CC2=NC(=O)NN2)[C@@H]1c3ccc(F)cc3)c4cc(cc(c4)C(F)(F)F)C(F)(F)F | aprepitant
arsorosooxy(oxo)arsane | oxoarsanyloxyarsenic | O=[As]O[As]=O | arsenic trioxide
(1R,4S,5R,8S,9R,10S,12S,13S)-10-methoxy-5,9-dimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecane | (1R,4S,5R,8S,9R,10S,12R,13R)-10-methoxy-1,5,9-trimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecane | CO[C@H]1O[C@@H]2O[C@@]3(C)CC[C@H]4[C@H](C)CC[C@@H]([C@H]1C)[C@@]24OO3 | artemether
4-oxo-4-[(1S,4R,5S,8S,9R,10S,15S)-4,9,12-trimethyl-11,16,17,18-tetraoxatetracyclo[10.3.2.05,15.08,15]heptadecan-10-yl]butanoicacid | 4-oxo-4-[[(4S,5R,8S,9R,10R,12R,13R)-1,5,9-trimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecan-10-yl]oxy]butanoicacid | C[C@@H]1CC[C@H]2[C@@H](C)[C@H](O[C@@H]3OC4(C)CC[C@@H]1[C@@]23OO4)OC(=O)CCC(O)=O | artesunate
5-(1,2-dihydroxyethyl)-4-methylidenefuran-2,3-diol | 2-(1,2-dihydroxyethyl)-4,5-dihydroxyfuran-3-one | OCC(O)C1OC(=C(O)C1=O)O | ascorbic acid
methylN-[(2S)-1-[2-[(2S,3S)-2-hydroxy-3-[[(2S)-2-(methoxycarbonylamino)-3,3-dimethylbutanoyl]amino]-4-phenylbutyl]-2-[(4-pyridin-2-ylcyclohexa-2,5-dien-1-yl)methyl]hydrazinyl]-3,3-dimethyl-1-oxobutan-2-yl]carbamate | methylN-[(2S)-1-[2-[(2S,3S)-2-hydroxy-3-[[(2S)-2-(methoxycarbonylamino)-3,3-dimethylbutanoyl]amino]-4-phenylbutyl]-2-[(4-pyridin-2-ylphenyl)methyl]hydrazinyl]-3,3-dimethyl-1-oxobutan-2-yl]carbamate | COC(=O)N[C@H](C(=O)N[C@@H](Cc1ccccc1)[C@@H](O)CN(Cc2ccc(cc2)c3ccccn3)NC(=O)[C@@H](NC(=O)OC)C(C)(C)C)C(C)(C)C | atazanavir
(3R,5R)-7-[2-(4-fluorocyclohepta-2,4,6-trien-1-ylidene)-3-phenyl-4-(phenylcarbamoyl)-5-propan-2-yl-3H-pyrrol-1-yl]-3,5-dihydroxyheptanoicacid | (3R,5R)-7-[2-(4-fluorophenyl)-3-phenyl-4-(phenylcarbamoyl)-5-propan-2-ylpyrrol-1-yl]-3,5-dihydroxyheptanoicacid | CC(C)c1n(CC[C@@H](O)C[C@@H](O)CC(O)=O)c(c2ccc(F)cc2)c(c3ccccc3)c1C(=O)Nc4ccccc4 | atorvastatin
5-[3-[1-[(4,5-dimethoxycyclohexa-1,3,5-trien-1-yl)methyl]-7,8-dimethoxy-2-methyl-1,3,4,6-tetrahydroisoquinolin-2-ium-2-yl]propanoyloxy]pentyl13-[4-[2-[4,5,6-trimethoxy-10-(4,5-dimethoxycyclohexa-2,4-dien-1-ylidene)cyclobut-2-en-1-yl]ethyl]-4-methyl-7-oxo-1-oxa-4-azoniacyclononan-1-yl]propanoate | 5-[3-[1-[(3,4-dimethoxyphenyl)methyl]-6,7-dimethoxy-2-methyl-3,4-dihydro-1H-isoquinolin-2-ium-2-yl]propanoyloxy]pentyl3-[1-[(3,4-dimethoxyphenyl)methyl]-6,7-dimethoxy-2-methyl-3,4-dihydro-1H-isoquinolin-2-ium-2-yl]propanoate | COc1ccc(CC2c3cc(OC)c(OC)cc3CC[N+]2(C)CCC(=O)OCCCCCOC(=O)CC[N+]4(C)CCc5cc(OC)c(OC)cc5C4Cc6ccc(OC)c(OC)c6)cc1OC | atracurium
(9-methyl-4-oxa-9-azabicyclo[4.2.1]nonan-5-yl)3-hydroxy-2-phenylpropanoate | (8-methyl-8-azabicyclo[3.2.1]octan-3-yl)3-hydroxy-2-phenylpropanoate | CN1C2CCC1CC(C2)OC(=O)C(CO)c3ccccc3 | atropine
[(2S,5R)-2-(carbamoyl)-7-oxo-1,6-diazabicyclo[3.2.1]octan-6-yl]hydrogensulfate | [(2S,5R)-2-carbamoyl-7-oxo-1,6-diazabicyclo[3.2.1]octan-6-yl]hydrogensulfate | NC(=O)[C@@H]1CC[C@@H]2CN1C(=O)N2OS(O)(=O)=O | avibactam
1-methyl-4-nitro-5-(7H-purin-6-ylsulfanyl)-4H-pyrimidine | 6-(3-methyl-5-nitroimidazol-4-yl)sulfanyl-7H-purine | Cn1cnc(c1Sc2ncnc3nc[nH]c23)[N+]([O-])=O | azathioprine
(2R,3S,5R,6S,7R,9S)-7-[(2R,4R)-5-[[(2R,3R,4R,5R)-4,5-dihydroxy-3-methoxy-5-methyloxan-2-yl]-methylamino]-2-hydroxy-4-methylpentan-2-yl]-9-[(2R,4S,5S,6S)-4-(dimethylamino)-5-hydroxypentan-2-yl]oxy-3-ethyl-6-hydroxy-2,6-dimethyl-4-[(2R,4R,5S,6S)-5-hydroxy-4-methoxy-4-methyloxan-2-yl]oxyoxonan-1-one | (2R,3S,4R,5R,8R,10R,11R,13S,14R)-11-[(2S,3R,4S,6R)-4-(dimethylamino)-3-hydroxy-6-methyloxan-2-yl]oxy-2-ethyl-3,4,10-trihydroxy-13-[(2R,4R,5S,6S)-5-hydroxy-4-methoxy-4,6-dimethyloxan-2-yl]oxy-3,5,6,8,10,12,14-heptamethyl-1-oxa-6-azacyclopentadecan-15-one | CC[C@H]1OC(=O)[C@H](C)[C@@H](O[C@H]2C[C@@](C)(OC)[C@@H](O)[C@H](C)O2)C(C)[C@@H](O[C@@H]3O[C@H](C)C[C@@H]([C@H]3O)N(C)C)[C@](C)(O)C[C@@H](C)CN(C)[C@H](C)[C@@H](O)[C@]1(C)O | azithromycin
barium(2+);sulfate | barium(2+);sulfate | [Ba++].[O-][S]([O-])(=O)=O | barium sulfate
(1S,10S,11S,13S,14S,15S,17S)-18-chloro-14,17-dihydroxy-14-(2-hydroxyacetyl)-13,15,18-trimethyltetracyclo[8.7.1.01,6.011,15]octadeca-2,5-dien-4-one | (8S,9R,10S,11S,13S,14S,16S,17R)-9-chloro-11,17-dihydroxy-17-(2-hydroxyacetyl)-10,13,16-trimethyl-6,7,8,11,12,14,15,16-octahydrocyclopenta[a]phenanthren-3-one | C[C@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(Cl)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO | beclometasone
(1R,2S)-1-(6-bromo-2-methoxyquinolin-3-yl)-4-(dimethylamino)-2-naphthalen-1-yl-1-phenylbutan-2-ol | (1R,2S)-1-(6-bromo-2-methoxyquinolin-3-yl)-4-(dimethylamino)-2-naphthalen-1-yl-1-phenylbutan-2-ol | COC1=NC2=C(C=C(Br)C=C2)C=C1[C@@H](C1=CC=CC=C1)[C@@](O)(CCN(C)C)C1=CC=CC2=C1C=CC=C2 | bedaquiline
11-[bis(2-chloroethyl)amino]-4-methyl-2,4-diazabicyclo[7.3.1]trideca-1(12),2,9-triene-3-carboxylicacid | 4-[5-[bis(2-chloroethyl)amino]-1-methylbenzimidazol-2-yl]butanoicacid | Cn1cCC(O)=O)nc2cc(ccc12)N(CCClCll | bendamustine

</body>

</html>

From the table above, I noticed that few prediction were different and I decided to check online to validate these compound and I realized Pubchem does that.I listed out the following drug with whose has different/slight prediction between Ersilia and STOUT . 
Provided below is the link with the iupac_name of those drug from from Pubchem:
[Abracavir](https://pubchem.ncbi.nlm.nih.gov/compound/441300#section=Names-and-Identifiers)
[Abiraterone](https://pubchem.ncbi.nlm.nih.gov/compound/132971#section=Names-and-Identifiers)
[Acetazolamide](https://pubchem.ncbi.nlm.nih.gov/compound/1986#section=Names-and-Identifiers)
[Aclidinium](https://pubchem.ncbi.nlm.nih.gov/compound/11519741#section=Names-and-Identifiers)
[Afatinib](https://pubchem.ncbi.nlm.nih.gov/#query=afatinib)
[Amikacin](https://pubchem.ncbi.nlm.nih.gov/#query=amikacin)
[Amlodipine](https://pubchem.ncbi.nlm.nih.gov/#query=amlodipine)
[Anidulafungin](https://pubchem.ncbi.nlm.nih.gov/#query=anidulafungin)
[Apixaban](https://pubchem.ncbi.nlm.nih.gov/#query=apixaban)
[Ascorbic Acid](https://pubchem.ncbi.nlm.nih.gov/#query=ascorbic%20acid)
[Bendamustine](https://pubchem.ncbi.nlm.nih.gov/#query=bendamustine)

### **Observation** 
I noticed that STOUT prediction is 100% accurate when in comparison with Pubchem iupac name WHILE  Ersilia is 80% accurate with Pubchem IUPAC name

### **Task 5 - Install and run Docker!**  **Day 20 (22nd October 2023)**
This is my first-hand experience dealing with docker. So I took my time to read and understand [docker documentation](https://docs.docker.com/get-started/) the installation process of docker, the functionality, and command-line interface.
These are the following steps and code I use to install docker on Ubuntu
1. Updated the local package using `sudo apt update`
2. Installed required dependencies
`sudo apt install -y apt-transport-https ca-certificates curl software-properties-common`
After running this, I got an error.
The error is: `E: Sub-process /usr/bin/dpkg returned an error code (1)`. 
I checked online and saw that the error could be a result of broken dependencies. Then, I ran the following command to fix that and it worked

sudo dpkg --configure -a sudo apt --fix-broken install

This fixed the error I was getting
3. Added Docker's repository and Docker's official GPG key to verify the package system

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

4. Next, I Install Docker Engine using this command `sudo apt install -y docker-ce docker-ce-cli containerd.io`
5. Then, I started and enabled the docker

sudo systemctl start docker sudo systemctl enable docker

6. To verify if I successfully installed docker, I ran decided to check the version of docker installed using the command `docker --version`, and this [docker_output](https://github.com/ersilia-os/ersilia/files/13196593/docker_output.txt) shows it has been successfully installed.
7.  I ran `docker ps` to test docker and my output was:

(base) ajoke@DESKTOP-KTJU3QV:~$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORT NAMES

This further explains that I've no container currently.
9. To ensure its functionality, I proceeded to test the docker by running `docker run hello-world` to pull and run a container. This is the resulting output below:

(base) ajoke@DESKTOP-KTJU3QV:~$ docker run hello-world

Hello from Docker! This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

  1. The Docker client contacted the Docker daemon.
  2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64)
  3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
  4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.

To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/

For more examples and ideas, visit: https://docs.docker.com/get-started/

Since this is my first-hand experience using Docker, I decided to explore by pulling a stout model from Ersilia's using Docker. I did the following:
- I pulled out model `eos4se9` using this command `docker pull ersiliaos/eos4se9` and I got the following output

(base) ajoke@DESKTOP-KTJU3QV:~$ docker pull ersiliaos/eos4se9 Using default tag: latest latest: Pulling from ersiliaos/eos4se9 8b91b88d5577: Pull complete 824416e23423: Pull complete bbe2c2981082: Pull complete 7b6b68d15a5c: Pull complete 71f8f4db541d: Pull complete 4f4fb700ef54: Pull complete 278266b40c52: Pull complete 4298588a86ad: Pull complete dddca77c0f59: Pull complete a113a2030c72: Pull complete 0c8571d61669: Pull complete Digest: sha256:3c0b4dab7a313bfb33c74b45ca378f7d69b0b9dbaaf843357780180910af31ab Status: Downloaded newer image for ersiliaos/eos4se9:latest docker.io/ersiliaos/eos4se9:latest

- Then I proceeded to run the model `eos4ee9` using this command `docker run ersiliaos/eos4se9`
- I ran docker ps and got this output

(base) ajoke@DESKTOP-KTJU3QV:~$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORT NAMES d6e3b2b2b5fd ersiliaos/eos4se9 "sh /root/docker-ent…" 50 seconds ago Up 15 seconds 83/tcp pedantic_antonelli

HellenNamulinda commented 9 months ago

Hello @Ajoke23, From your updated comment here, it appears you were able to fetch the model and also completed all the week 1 tasks. Well done :clap:!

Please proceed to week 2 tasks and update each in a comment for faster follow-up. Incase you need any help, kindly let us know.

Ajoke23 commented 9 months ago

Hello @Ajoke23, From your updated comment here, it appears you were able to fetch the model and also completed all the week 1 tasks. Well done :clap:!

Please proceed to week 2 tasks and update each in a comment for faster follow-up. Incase you need any help, kindly let us know.

Yes, I was able to fetch the model successfully. Thanks a lot @HellenNamulinda

HellenNamulinda commented 8 months ago

Hello @Ajoke23, You are yet to complete week 2 tasks. Is there any way we can support you?

Ajoke23 commented 8 months ago

Hello @Ajoke23, You are yet to complete week 2 tasks. Is there any way we can support you?

Hi @HellenNamulinda I had lot of challenges doing the week 2 task based on my model of interest, STOUT. But I have been able to figure it out and I will update it soonest. Thank you

DhanshreeA commented 8 months ago

Hi @Ajoke23 thank you for the updates. Let us know how it goes! :)

Ajoke23 commented 8 months ago

WEEK 3 - MODEL SUGGESTIONS

TASK 1: FIRST MODEL SUGGESTION

Model Title: A robust deep learning workflow to predict CD8 + T-cell epitopes

Date of Publication: 13th September, 2023 Publication: Genome Medicine License: Creative Commons Dataset: Dataset Used Source Code: https://github.com/ChloeHJ/TRAP Code: Python and R Slug: TRAP

DESCRIPTION OF THIS MODEL

TRAP model utilizes the use of deep learning for prediction of immunogenicity and decision tree classifier for estimating the degree of correctness. It utilize the following features such as: amino acids at contact position, hydrophobicity, large and aromatic side chains, peptide-MHC binding affinity which correlates to the recognition of T-cell and robust prediction of CD8+ T-cell epitopes from MHC-I ligands.

RELEVANCE OF THIS MODEL TO ERSILIA

  1. Predicting CD8+ T-cell epitopes is of utmost importance when developing tools and vaccine for diseases that are dominant in low and middle-income countries such as Cancer, neglected tropical diseases which align with Ersilia's mission
  2. Understanding CD8+ T-cell epitopes is useful in diagnosing viral infections and diseases.
  3. The current experimental procedures for identify CD8+ T-cell epitopes is labor intensive and expensive. TRAP model which is a computational prediction model provides alternative ways to screen, predicting & characterize T-cell epitopes and most importantly, it's cost effective.
  4. Utilizing of the model helps to solves cancerous cell problem by destroying them and developing immunotherapies and adoptive cell therapy which is a cancer treatment.

CODE IMPLEMETATION

This code has a well detailed installation process. The following steps are needed in installing TRAP models:

TASK 2: SECOND MODEL SUGGESTION

Model Title: Enhancing drug property prediction with dual-channel transfer learning based on molecular fragment

Publication: BMC Bioinformatics Year of Publication: 2023 Authors: Yue Wu, Xinran Ni, Zhihao Wang & Weike Feng Slug: FREL Source Code: https://github.com/Ruowu9944/FREL Dataset: GraphMVP,MoleculeNet License: None Code: Python

DESCRIPTION OF THIS MODEL

The model incorporates neural network specifically FRagment-based dual-channEL pretraining (FREL) which uses generative learning and contrastive learning techniques to achieve intra- and inter-molecular agreement. The molecular fragments provides a deeper understanding of underlying molecular mechanisms which will help researchers in customizing drug design that are tailored to specific diseases and patient populations. Research shows that learned molecular representations better capture the drug property variation, fragment semantics which provides insightful relationship between molecules fragment and drug discovery

RELEVANCE OF THIS MODEL TO ERSILIA

  1. Accurate predictions of molecular property is useful in Drug repositioning i.e. identifying new uses in existing drug which are highly effective against infectious diseases. Thus, saving cost, time and resources.
  2. Infectious diseases that has been neglected due to resources constraints can now be fully implemented using this model for the development of various treatment of infectious diseases, viral infection e.t.c. This align with Ersilia's mission of making research accessible to all.
  3. The model will empower researchers to partake in drug discovery since the model is time and cost effective

    CODE IMPLEMETATION

    The following version of dependencies must be met

    numpy             1.21.2
    scikit-learn      1.0.2
    pandas            1.3.4
    python            3.7.11
    torch             1.10.2+cu113
    torch-geometric   2.0.3
    transformers      4.17.0
    rdkit             2020.09.1.0
    ase               3.22.1
    descriptastorus   2.3.0.5
    ogb               1.3.3
    • Installation of dataset from here
      cd datasets
      python molecule_preparation.py
    • Pre train the classification model by using the command:
      cd src
      python pretrain_cls.py --dropout_ratio=0
    • Pre train the regression model python pretrain_reg.py --dropout_ratio=0

-Fine-tune classification model, run the following code: python finetune_cls.py --dropout_ratio=0.5 --dataset=bace

Pre-train regression model, run the following code: python finetune_reg.py --dropout_ratio=0.5 --dataset=esol

TASK 3: THIRD MODEL SUGGESTION

Model Title: Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Date of Publication: 29th May, 2023 Publication: Journal of Cheminformatics Author: Umit V. Ucak, Islambek Ashyrmamatov & Juyoung Lee Dataset: data Source Code: https://github.com/snu-lcbc/atom-in-SMILES Slug: AIS Code: Python License: CC BY-SA 4.0

DESCRIPTION atoms-in-Smiles uses the principle of tokenization schemes which is a preprocessing step in NLP (Natural Language Processing). The fall short of accuracy of traditional SMILES not been able to reflect true nature of molecules gave rise to atoms-In-Smiles (AIS). These tokenization provides provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models. This solves problem of: Single-step retrosynthesis, Molecular Property Prediction, Normalized repetition rate, Fingerprint nature of AIS, Single-token repetition (rep-l), input-output equivalent mapping

RELEVANCE TO ERSILIA

Accuracy of molecular property prediction depends on the quality of chemical language models. Molecular structure are useful for researches when developing new drug for treatment of infectious diseases. The relevance of chemical model for drug discovery of diseases makes it relevant to Ersilia.

CODE IMPLEMETATION

The code is well documented and I was able set it up doing the following with the use of Google Collab

IMPLEMENTATION

This model has various implementation such as: Single-step retrosynthesis, Molecular Property Prediction, Normalized repetition rate, Fingerprint nature of AIS, Single-token repetition (rep-l), input-output equivalent mapping. I will be working on implementing Normalized repetition rate which describes: Natural products, drugs, metal complexes, lipids, steroids', isomers. To achieve this, I use python code and the code is as shown below:

#importing the necessary libaries
import codecs
import tarfile
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from rdkit import Chem
import atomInSmiles
import selfies as sf
import deepsmiles
from SmilesPE.tokenizer import SPE_Tokenizer
from SmilesPE.tokenizer import atomwise_tokenizer
sns.set_theme()
def smiles_tokenizer(smi):
    #Tokenize a SMILES molecule or reaction
    import re
    pattern =  "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
    regex = re.compile(pattern)
    tokens = [token for token in regex.findall(smi)]
    assert smi == ''.join(tokens), f"{smi=}\t {''.join(tokens)=}"
    return  str ' '.join(tokens)
def get_rep(sent, l = 300):
    cnt = 0
    for i, w in enumerate(sent):
        if w in sent[max(i - l, 0):i]:
            cnt += 1
    return cnt

#repld = get_rep(test)
#repld
with tarfile.open('/datum.tar.gz') as tarf: 
    tarf.extractall('data')
    print(f"Extracting files...")

data_files = {
    'data/steroids_final.data': 'Stereoids',
    'data/metals_final.data': 'Metal complexes',
    'data/fda_final.data': 'FDA approved drugs',
    'data/lipids_final.data': 'Lipids',
    'data/naturals_final.data': 'Natural products',
    'data/isomer.data': 'Isomers of octane',
}

def create_catplot(data_files):
    for csv_file, subtitle in data_files.items():
        # Load data
        print(csv_file)
        df = pd.read_csv(csv_file, sep='\t', header=None)
        df.columns = ['Token types', 'Repetition', 'Normalized repetition', 'Length', 'Unique tokens']

        # Create catplot
        catplot = sns.catplot(
            data=df, x="Token types", y="Normalized repetition", hue="Unique tokens",
            native_scale=True, zorder=1
        )
        catplot.set(ylim=(-0.03, 1.0))
        catplot.set_xticklabels(rotation=90)
        catplot.set_xlabels("Token types", fontsize=14)
        catplot.set_ylabels("Normalized repetition", fontsize=14)
        # Set font size for hue legend
        #catplot.ax.legend(title="Unique tokens", fontsize=14)
        catplot.ax.tick_params(axis='x', labelsize=14)
        catplot.ax.tick_params(axis='y', labelsize=14)

        # Map the Token types to integers
        mapping = {'DeepSMILES': 0, 'SMILES': 1, 'SELFIES': 2, 'AIS': 3, 'SmilesPE': 4}
        df['Token types'] = df['Token types'].map(mapping)
        # Compute mean and standard deviation for each Token type
        mean_vals = df.groupby(['Token types'])['Normalized repetition'].mean()
        std_vals = df.groupby(['Token types'])['Normalized repetition'].std()
   # Plot the mean values and error bars
        for i, (mean_val, std_val) in enumerate(zip(mean_vals, std_vals)):
            x_pos = i  # the x position of the horizontal line
            y_pos = mean_val  # the y position of the horizontal line
            #color = sns.color_palette()[i]  # the color of the horizontal line
            plt.plot([x_pos + 0.2, x_pos + 0.4], [y_pos, y_pos], color='black', linestyle='-', linewidth=1)
            plt.plot([x_pos + 0.3, x_pos + 0.3], [y_pos - std_val, y_pos + std_val], linestyle=':',color='black', linewidth=1)
        # Add title and adjust margins
        catplot.fig.suptitle(subtitle, fontsize=14, fontweight='bold')
        plt.subplots_adjust(top=0.93, bottom=0.3)
        plt.gcf().set_size_inches(6, 6)
        # Save the plot
        plt.savefig(csv_file[:-5] + 'Ho.png')
        # Close the plot to free up memory
        # plt.close()
 #Example usage
create_catplot(data_files)

The ouput:

fda_finalHo naturals_finalHo metals_finalHo steroids_finalHo

The distributions show the unique characteristics of tokenization schemes on representative datasets, designed to test different facets of molecular structures such as coordination compounds, ligands (metal complexes), ring structures and functional groups (steroids), long-chain formations (phospholipids, ionizable lipids), complex and diverse structures (natural products)

SUMMARY

To avoid duplication of model, I checked the list of pending model yet to be incorporated and I searched through my 3 models I suggested and none was found in the list. Hence, all the 3 models suggested are new

Ajoke23 commented 8 months ago

Week 4 - Submit the final application in the Outreachy website

GemmaTuron commented 8 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!