Closed samuelmaina closed 1 year ago
My Motivation to work at Ersilia: Hoping you are fine. I graduated in Dec, 2022 at Moi University in Kenya with a BSc. Computer Science. I have been facinated by computers and how fast and accurate they (if given correct instructions). If humanity can harness that power then we would make the world a better place and more abundant. I have experienced hard living conditions and severe poverty during my life time. I have seen people die in my community due to lack of medicine and vaccines which are readily available in other part of the world. The health conditions for Africa and other third world countries need to be acted upon in total seriousness. I have developed few machine learning model such as image classifier, prediction of buying behaviors based on income information and demographics. I have practiced paradigms in python such as Clustering and Classification and different algorithms such as k-means, random forests etc. I have used common packages such numpy, pandas, sklearn and many other for newtorking, drawing graphs & plots etc. I hope contributing to the Ersilia Community will at least open my eyes on what can be done . Ersilia is a huge community with advanced technologies and algorithms I hope to learn advanced AL/ML so as to increase my knowledge and skills. I am eager to learn, collaborate and contribute to the Ersilia community during my internship. Thank you for your consideration.
Hi @samuelmaina Welcome to Ersilia, great to have you here! Please let us know which system are you using and whether you had any issues installing Ersilia. When you are done, check this issue https://github.com/ersilia-os/eos81ew/issues/2 and see if the bug Ahmed is encountering is specific to his system or it also happens to you! Please work together to make sure this model is working :)
I am using wsl2(window 10) for Ubuntu 20.04 LTS. I had one error during the installation which I have raised as a bug at issue https://github.com/ersilia-os/ersilia/issues/630. I was able to run the sample model. I will look into issue https://github.com/ersilia-os/eos81ew/issues/2 and get back to you.
Used the 2 files(run and the list_run) provided by @pauline-banye . The model run successfully and provided the following out put. run_output.csv and list_run_output.csv
Thanks @samuelmaina for the tests! Sorry, closed the issue inadvertently
Hello @GemmaTuron,I am running STOUT model. The ersilia model hub and the github model have different predictions for the smiles and can-smiles. I extracted the smiles and the can-smiles data from the eml_canonical.csv data provided in the contribution guide. I then made predictions from the two sets of data and make prediction using the STOUT module. I used python to carry out the steps. I choose the model because it uses deep learning using neural networks to make prediction . The model was trained with billions of smiles labelled with their UIPAC names. The model was able to co-relate the smile string structure with the UIPAC names. IUPAC name generation has a lot of algorithmic complexity and large set of rules which makes very hard to code all the rules into program generators. The model has only two function the forward_translation(which is used to give the UIPAC name of the smiles) and the backward_translation(which gives the smiles names from given smiles). It has an BLEU score of about 90% and a Tanimoto similarity index of more than 0.9 according to those who trained it. It did not given any confidence level as an output when I run it. smiles.csv can_smiles.csv ithub.com/ersilia-os/ersilia/files/11007398/smiles.csv) The python code:
from STOUT import translate_forward, translate_reverse
import csv
import json
def read_data_from_file(path):
result = []
with open(path, 'r') as file:
csvreader = csv.reader(file)
for row in csvreader:
result.append(row)
return result
def write_data_to_json_file(output_file, data):
with open(output_file, 'w') as f:
json.dump(data, f)
def separate_smiles_and_can_smiles_into_separate_files():
with open("smiles.csv", 'w') as file_1:
with open("smiles_can.csv", 'w') as file_2:
with open("eml_canonical.csv", 'r') as file_3:
csvreader = csv.reader(file_3)
writer_2 = csv.writer(file_2)
writer_1 = csv.writer(file_1)
for row in csvreader:
writer_1.writerow([row[1]])
writer_2.writerow([row[2]])
def get_uipac_name_from_smiles(smiles: list, can_smiles: list):
smiles_uipac_names = []
can_smiles_uipac_names = []
n = 10
for i in range(n):
smiles_uipac = translate_forward(smiles[i])
can_smiles_uipac = translate_forward(can_smiles[i])
can_smiles_uipac_names.append({
"smile": can_smiles[i],
"UIPAC_name": can_smiles_uipac
})
smiles_uipac_names.append({
"can_smile": smiles[i],
"UIPAC_name": smiles_uipac
})
print(i+1, "done out of ", n, " currently at ", (i+1)/n * 100, "% done")
return smiles_uipac_names, can_smiles_uipac_names
# separate the smiles and can-smiles into different csv files from the eml_canonical.csv
separate_smiles_and_can_smiles_into_separate_files()
smiles = read_data_from_file("smiles.csv")
can_smiles = read_data_from_file("can_smiles.csv")
smiles_output, can_smiles_output = get_uipac_name_from_smiles(
smiles, can_smiles)
write_data_to_json_file("smiles_output.json", smiles_output)
write_data_to_json_file("can_smiles_output.json", can_smiles_output)
Here are input and results from the STOUT repository code (the one I run locally):
smiles_output.json
[ { "smile": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1", "UIPAC_name": "[(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol" }, { "smile": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5", "UIPAC_name": "(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol" }, { "smile": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O", "UIPAC_name": "N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide" }, { "smile": "CC(O)=O", "UIPAC_name": "aceticacid" }, { "smile": "CC(=O)N[C@@H](CS)C(O)=O", "UIPAC_name": "(2R)-2-acetamido-3-sulfanylpropanoicacid" }, { "smile": "CC(=O)Oc1ccccc1C(O)=O", "UIPAC_name": "2-acetyloxybenzoicacid" }, { "smile": "NC1=NC(=O)c2ncn(COCCO)c2N1", "UIPAC_name": "2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one" }, { "smile": "OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1", "UIPAC_name": "[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]2-hydroxy-2,2-dithiophen-2-ylacetate" }, { "smile": "CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1", "UIPAC_name": "(E)-N-[4-(3-chloro-4-fluoroanilino)-7-[(3S)-oxolan-3-yl]oxyquinazolin-6-yl]-4-(dimethylamino)but-2-enamide" }, { "smile": "CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1", "UIPAC_name": "methylN-(6-propylsulfanyl-1H-benzimidazol-2-yl)carbamate" } ]
can-smiles.json
[ { "can_smile": "Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1", "UIPAC_name": "[(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol" }, { "can_smile": "C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43C)[C@@H]1CC=C2c1cccnc1", "UIPAC_name": "(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol" }, { "can_smile": "CC(=O)Nc1nnc(S(N)(=O)=O)s1", "UIPAC_name": "N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide" }, { "can_smile": "CC(=O)O", "UIPAC_name": "aceticacid" }, { "can_smile": "CC(=O)N[C@@H](CS)C(=O)O", "UIPAC_name": "(2R)-2-acetamido-3-sulfanylpropanoicacid" }, { "can_smile": "CC(=O)Oc1ccccc1C(=O)O", "UIPAC_name": "2-acetyloxybenzoicacid" }, { "can_smile": "Nc1nc(=O)c2ncn(COCCO)c2[nH]1", "UIPAC_name": "2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one" }, { "can_smile": "O=C(O[C@H]1C[N+]2(CCCOc3ccccc3)CCC1CC2)C(O)(c1cccs1)c1cccs1", "UIPAC_name": "[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]2-hydroxy-2,2-dithiophen-2-ylacetate" }, { "can_smile": "CN(C)C/C=C/C(=O)Nc1cc2c(Nc3ccc(F)c(Cl)c3)ncnc2cc1O[C@H]1CCOC1", "UIPAC_name": "(E)-N-[4-(3-chloro-4-fluoroanilino)-7-[(3S)-oxolan-3-yl]oxyquinazolin-6-yl]-4-(dimethylamino)but-2-enamide" }, { "can_smile": "CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1", "UIPAC_name": "methylN-(6-propylsulfanyl-1H-benzimidazol-2-yl)carbamate" } ]
Here are the results of the ersilia model:
ersilia_smiles_output.json
[ { "input": { "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N", "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1", "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1" }, "output": { "outcome": [ "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol" ] } }, { "input": { "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N", "input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5", "text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5" }, "output": { "outcome": [ "(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol" ] } }, { "input": { "key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N", "input": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O", "text": "CC(=O)Nc1sc(nn1)[S](N)(=O)=O" }, "output": { "outcome": [ "N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide" ] } }, { "input": { "key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N", "input": "CC(O)=O", "text": "CC(O)=O" }, "output": { "outcome": [ "aceticacid" ] } }, { "input": { "key": "PWKSKIMOESPYIA-BYPYZUCNSA-N", "input": "CC(=O)N[C@@H](CS)C(O)=O", "text": "CC(=O)N[C@@H](CS)C(O)=O" }, "output": { "outcome": [ "(2R)-2-acetamido-3-sulfanylpropanoicacid" ] } }, { "input": { "key": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", "input": "CC(=O)Oc1ccccc1C(O)=O", "text": "CC(=O)Oc1ccccc1C(O)=O" }, "output": { "outcome": [ "2-acetyloxybenzoicacid" ] } }, { "input": { "key": "MKUXAQIIEYXACX-UHFFFAOYSA-N", "input": "NC1=NC(=O)c2ncn(COCCO)c2N1", "text": "NC1=NC(=O)c2ncn(COCCO)c2N1" }, "output": { "outcome": [ "2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one" ] } }, { "input": { "key": "ASMXXROZKSBQIH-VITNCHFBSA-N", "input": "OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1", "text": "OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1" }, "output": { "outcome": [ "2-[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]oxy-1,1-dithiophen-2-ylethanol" ] } }, { "input": { "key": "ULXXDDBFHOBEHA-CWDCEQMOSA-N", "input": "CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1", "text": "CN(C)C\\C=C\\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1" }, "output": { "outcome": [ "(E)-N-[6-[[(3-chloro-4-fluorocyclohexa-1,4-dien-1-yl)amino]methylidene]-3-[(3S)-oxolan-3-yl]oxycyclopenta[d]pyrimidin-2-yl]-4-(dimethylamino)but-2-enamide" ] } }, { "input": { "key": "HXHWSAZORRCQMX-UHFFFAOYSA-N", "input": "CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1", "text": "CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1" }, "output": { "outcome": [ "methylN-(6-propylsulfanyl-1H-benzimidazol-2-yl)carbamate" ] } }, { "input": { "key": "OFCNXPDARWKPPY-UHFFFAOYSA-N", "input": "O=C1N=CN=C2NNC=C12", "text": "O=C1N=CN=C2NNC=C12" }, "output": { "outcome": [ "1,2-dihydropyrazolo[3,4-d]pyrimidin-4-one" ] } }, { "input": { "key": "YVPYQUNUQOZFHG-UHFFFAOYSA-N", "input": "CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(O)=O)c1I", "text": "CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(O)=O)c1I" }, "output": { "outcome": [ "5-acetamido-2,4,6-triiodo-3-(1-oxoethylamino)cyclohexa-4,6-diene-1-carboxylicacid" ] } }, { "input": { "key": "LKCWBDHBTVXHDL-RMDFUYIESA-N", "input": "NCC[C@H](O)C(=O)N[C@@H]1C[C@H](N)[C@@H](O[C@H]2O[C@H](CN)[C@@H](O)[C@H](O)[C@H]2O)[C@H](O)[C@H]1O[C@H]3O[C@H](CO)[C@@H](O)[C@H](N)[C@H]3O", "text": "NCC[C@H](O)C(=O)N[C@@H]1C[C@H](N)[C@@H](O[C@H]2O[C@H](CN)[C@@H](O)[C@H](O)[C@H]2O)[C@H](O)[C@H]1O[C@H]3O[C@H](CO)[C@@H](O)[C@H](N)[C@H]3O" }, "output": { "outcome": [ "(2S)-N-[(1R,2R,3R,5S,6R)-5-amino-2-[(2R,3R,4R,5R,6R)-3-amino-4,5,6-trihydroxyoxan-2-yl]oxy-3-[(2R,3S,4R,5R)-5-amino-1,3,4,6-tetrahydroxyhexan-2-yl]oxy-1-hydroxyoxetan-6-yl]-2-hydroxy-4-(methylamino)butanamide" ] } }, { "input": { "key": "XSDQTOBWRPYKKA-UHFFFAOYSA-N", "input": "NC(N)=NC(=O)c1nc(Cl)c(N)nc1N", "text": "NC(N)=NC(=O)c1nc(Cl)c(N)nc1N" }, "output": { "outcome": [ "3,5-diamino-2-chloro-N-(diaminomethylidene)-2H-pyrazine-6-carboxamide" ] } }, { "input": { "key": "IYIKLHRQXLHMJQ-UHFFFAOYSA-N", "input": "CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3", "text": "CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3" }, "output": { "outcome": [ "2-butyl-3-[4-[2-(diethylamino)ethoxy]-3,5-diiodocyclohexa-1,4-dien-1-yl]chromen-4-one" ] } } ]
ersilia_can_smiles_output:
[ { "input": { "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N", "input": "Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1", "text": "Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1" }, "output": { "outcome": [ "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol" ] } }, { "input": { "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N", "input": "C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43C)[C@@H]1CC=C2c1cccnc1", "text": "C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43C)[C@@H]1CC=C2c1cccnc1" }, "output": { "outcome": [ "(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol" ] } }, { "input": { "key": "BZKPWHYZMXOIDC-UHFFFAOYSA-N", "input": "CC(=O)Nc1nnc(S(N)(=O)=O)s1", "text": "CC(=O)Nc1nnc(S(N)(=O)=O)s1" }, "output": { "outcome": [ "N-[5-[amino(dioxo)-\u03bb6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide" ] } }, { "input": { "key": "QTBSBXVTEAMEQO-UHFFFAOYSA-N", "input": "CC(=O)O", "text": "CC(=O)O" }, "output": { "outcome": [ "aceticacid" ] } }, { "input": { "key": "PWKSKIMOESPYIA-BYPYZUCNSA-N", "input": "CC(=O)N[C@@H](CS)C(=O)O", "text": "CC(=O)N[C@@H](CS)C(=O)O" }, "output": { "outcome": [ "(2R)-2-acetamido-3-sulfanylpropanoicacid" ] } }, { "input": { "key": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", "input": "CC(=O)Oc1ccccc1C(=O)O", "text": "CC(=O)Oc1ccccc1C(=O)O" }, "output": { "outcome": [ "2-acetyloxybenzoicacid" ] } }, { "input": { "key": "MKUXAQIIEYXACX-UHFFFAOYSA-N", "input": "Nc1nc(=O)c2ncn(COCCO)c2[nH]1", "text": "Nc1nc(=O)c2ncn(COCCO)c2[nH]1" }, "output": { "outcome": [ "2-amino-9-(2-hydroxyethoxymethyl)-3H-purin-6-one" ] } }, { "input": { "key": "ASMXXROZKSBQIH-VITNCHFBSA-N", "input": "O=C(O[C@H]1C[N+]2(CCCOc3ccccc3)CCC1CC2)C(O)(c1cccs1)c1cccs1", "text": "O=C(O[C@H]1C[N+]2(CCCOc3ccccc3)CCC1CC2)C(O)(c1cccs1)c1cccs1" }, "output": { "outcome": [ "2-[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]oxy-1,1-dithiophen-2-ylethanol" ] } }, { "input": { "key": "ULXXDDBFHOBEHA-CWDCEQMOSA-N", "input": "CN(C)C/C=C/C(=O)Nc1cc2c(Nc3ccc(F)c(Cl)c3)ncnc2cc1O[C@H]1CCOC1", "text": "CN(C)C/C=C/C(=O)Nc1cc2c(Nc3ccc(F)c(Cl)c3)ncnc2cc1O[C@H]1CCOC1" }, "output": { "outcome": [ "(E)-N-[4-[(4-chloro-5-fluorocyclohexa-1,3,6-trien-1-yl)amino]-7-[(3S)-oxolan-3-yl]oxyquinazolin-6-yl]-4-(dimethylamino)but-2-enamide" ] } }, { "input": { "key": "HXHWSAZORRCQMX-UHFFFAOYSA-N", "input": "CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1", "text": "CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1" }, "output": { "outcome": [ "methylN-(6-propylsulfanyl-1H-benzimidazol-2-yl)carbamate" ] } }, { "input": { "key": "OFCNXPDARWKPPY-UHFFFAOYSA-N", "input": "O=c1ncnc2[nH][nH]cc1-2", "text": "O=c1ncnc2[nH][nH]cc1-2" }, "output": { "outcome": [ "1,2-dihydropyrazolo[3,4-d]pyrimidin-4-one" ] } }, { "input": { "key": "YVPYQUNUQOZFHG-UHFFFAOYSA-N", "input": "CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(=O)O)c1I", "text": "CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(=O)O)c1I" }, "output": { "outcome": [ "5-acetamido-2,4,6-triiodo-3-(1-oxoethylamino)cyclohexa-4,6-diene-1-carboxylicacid" ] } }, { "input": { "key": "LKCWBDHBTVXHDL-RMDFUYIESA-N", "input": "NCC[C@H](O)C(=O)N[C@@H]1C[C@H](N)[C@@H](O[C@H]2O[C@H](CN)[C@@H](O)[C@H](O)[C@H]2O)[C@H](O)[C@H]1O[C@H]1O[C@H](CO)[C@@H](O)[C@H](N)[C@H]1O", "text": "NCC[C@H](O)C(=O)N[C@@H]1C[C@H](N)[C@@H](O[C@H]2O[C@H](CN)[C@@H](O)[C@H](O)[C@H]2O)[C@H](O)[C@H]1O[C@H]1O[C@H](CO)[C@@H](O)[C@H](N)[C@H]1O" }, "output": { "outcome": [ "(2S)-N-[(1R,2R,3R,5S,6R)-5-amino-2-[(2R,3R,4R,5R,6R)-3-amino-4,5,6-trihydroxyoxan-2-yl]oxy-3-[(2R,3S,4R,5R)-5-amino-1,3,4,6-tetrahydroxyhexan-2-yl]oxy-1-hydroxyoxetan-6-yl]-2-hydroxy-4-(methylamino)butanamide" ] } }, { "input": { "key": "XSDQTOBWRPYKKA-UHFFFAOYSA-N", "input": "NC(N)=NC(=O)c1nc(Cl)c(N)nc1N", "text": "NC(N)=NC(=O)c1nc(Cl)c(N)nc1N" }, "output": { "outcome": [ "3,5-diamino-2-chloro-N-(diaminomethylidene)-2H-pyrazine-6-carboxamide" ] } }, { "input": { "key": "IYIKLHRQXLHMJQ-UHFFFAOYSA-N", "input": "CCCCc1oc2ccccc2c1C(=O)c1cc(I)c(OCCN(CC)CC)c(I)c1", "text": "CCCCc1oc2ccccc2c1C(=O)c1cc(I)c(OCCN(CC)CC)c(I)c1" }, "output": { "outcome": [ "2-butyl-3-[4-[2-(diethylamino)ethoxy]-3,5-diiodocyclohexa-1,4-dien-1-yl]chromen-4-one" ] } } ]
Hi @samuelmaina !
Thanks for the good work! Before moving onto week 3 tasks, can you please provide a bit more of explanation on the model? What does it mean " an BLEU score of about 90% and a Tanimoto similarity index of more than 0.9 according to those who trained it." And, can you tell us if the output of the model you run in your system and the model fromt he Hub are the same rather than just pasting the whole outcomes? You can choose one or two smiles as an example, perhaps
Thanks. I mentioned that the ersilia hub model was producing different results from the local model but maybe its slipped you due to the my phrasing. Sorry for the text output, I thought you needed some comparison. The stout model is simply used to translate the smiles into their UIPAC names and vice vasa. Infact, the model has only two function for doing exactly this. The model uses use Machine Translation Engine which is a special form of natural language processing translator between different languages using neural networks.BLEU(BiLingual Evaluation Understudy) score is a measure that is used to see how professional a text has been translated. It is not machine generated rather a professional does translates a piece of text and the the model is scored on what percentage translation it gets right. The model's 90% BLUE score means the model correctly translated 90% UIPAC names for the given smiles. Tanimoto similarity index of 0.9 means that the incorrect translated smiles were 90% similar to the true compound.
The model does not produce any error rate or confidence level for the input data. The BLEU score was done by the Model creators and they are doing it the BLEU test periodically to make sure accuracy for new compounds
For the internal implementation. The model first converts the compounds into tokens using natural language tokenizer. The tokens are classified into compounds and the bonds. The tokens are fed into an neural network which is used to make the prediction.
Hi @samuelmaina
Thanks, whenever you refer to scores etc it is best to explain them, specially if you are using acronyms, as you don't know if the reader will know about them or not! can you do this last check: And, can you tell us if the output of the model you run in your system and the model from the Hub are the same rather than just pasting the whole outcomes? You can choose one or two smiles as an example, perhaps before moving onto week 3 tasks? Thanks!
The ersilia hub model produced different results for the model that I run locally. I run 10 test smiles and the two models with all the UIPAC names being different for the all the 10 samples. I will show results for 2 smiles.
Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1
smiles
[(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol
[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol
C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5
smiles
(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol
(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol
From the two smiles, the Ersilia model has two minor misplacements in the naming which produces erroneous UIPAC names. As of now, I can't predict where the minor errors in the ersilia model emerge from.
Hi @samuelmaina !
This would be due to a small update in the STOUT version. Can you tell us which version is the STOUT now and which one is in the Hub? You can check by activating the relevant conda environments
The local model is version 2.0.5 but ersilia does not show the model version. I have tried to follow the github link in the ersilia model description but it points to the current git directory of the STOUT repo. I think its because ersilia hasn't fetched the latest code from github
(STOUT) samuelmayna@SAM:~/stout_project$ pip show stout-pypi
Name: STOUT-pypi
Version: 2.0.5
Summary: STOUT V2.0 - Smiles TO iUpac Translator Version 2.0
Home-page: https://github.com/Kohulan/Smiles-TO-iUpac-Translator
Author: Kohulan Rajan
Author-email: kohulan.rajan@uni-jena.de
License: MIT
Location: /home/samuelmayna/miniconda3/envs/STOUT/lib/python3.8/site-packages
Requires: jpype1, pystow, tensorflow, unicodedata2
Required-by:
ersilia model --version:
(ersilia) samuelmayna@SAM:~/stout_project$ ersilia card smiles2iupac
{
"Identifier": "eos4se9",
"Slug": "smiles2iupac",
"Status": "Ready",
"Title": "STOUT: SMILES to IUPAC name translator",
"Description": "Small molecules are represented by a variety of machine-readable strings (SMILES, InChi, SMARTS, among others). On the contrary, IUPAC (International Union of Pure and Applied Chemistry) names are devised for human readers. The authors trained a language translator model treating the SMILES and IUPAC as two different languages. 81 million SMILES were downloaded from PubChem and converted to SELFIES for model training. The corresponding IUPAC names for the 81 million SMILES were obtained with ChemAxon molconvert software.\n",
"Mode": "Pretrained",
"Input": [
"Compound"
],
"Input Shape": "Single",
"Task": [
"Representation"
],
"Output": [
"Text"
],
"Output Type": [
"String"
],
"Output Shape": "Single",
"Interpretation": "IUPAC name of a specific SMILES",
"Tag": [
"Chemical notation",
"Chemical language model"
],
"Publication": "https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4",
"Source Code": "https://github.com/Kohulan/Smiles-TO-iUpac-Translator",
"License": "MIT",
"Contributor": "carcablop"
}
Maybe you can suggest use of automatic redeployment after the github code is integrated and tested using devOps so that the two can be in sync.
Hi @samuelmaina
To check the python version of a package you need to activate the conda environment, and then for example call conda list
This will tell you which version of stout is Ersilia running
You can also check the requirements of the ersilia model, what you pasted above is simply the metadata file which does not contain versioning history
Thanks for the correction. Ersilia is running 2.0.1 version . I activated the model itself using its code and run conda-list.
Thanks @samuelmaina
we will update this model in the near future when we incorporate the reverse translation from iupac to smiles Let's move to week 3 tasks!
That would be really good as the model would be utilized fully.
OpenMM
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005659
https://github.com/openmm/openmm
Open is a toolkit for molecular simulation. It can be used either as a stand-alone application for running simulations, or as a library you call from your own code. It provides a combination of extreme flexibility , openness, and high performance (especially on recent GPUs) that make it truly unique among simulation codes.
Python, C, C++, and Fortran
Conda env Python 3.x
SUMMARY AND RELEVANCE TO ERSILIA'S MISSION Molecular simulations can accelerate drug discovery by reducing the time and cost associated with traditional experimental approaches. By using simulations to screen large libraries of compounds, researchers can quickly identify promising drug candidates for further testing and development, thereby streamlining the drug discovery process. In drug discovery, molecular simulations can help researchers to design and optimize drug candidates that target specific proteins or other biomolecules involved in disease pathways. For infectious and neglected diseases, this can be particularly useful because many of these diseases are caused mainly caused by pathogens that have evolved to evade traditional drug therapies hence making it hard for researchers to combat them. Simulations can be ran fast and many models can be made which will accelerate the path to finding cure for this diseases. Simulation can also be used to reduce drug side effects. Simulations can help to identify and optimize small molecule drug candidates that can bind with drug creating high affinity and specificity, thereby reducing the likelihood of side effects and increasing efficacy. The model support a variety of programming languages hence it will have a great audience in ersilia.
TASK Simulation
LICENSE None
TAGS Simulation Drug research
@GemmaTuron Should I add more details?
Hi @samuelmaina !
Thanks, a few comments:
Looking forward to the next articles!
Thanks very much for the feedback. I will be keen next time on the model descriptions.
@GemmaTuron , I found this model https://github.com/yueyu1030/SumGNN and it does not have a trained model but its code seems to produce very promising model on Drug Interaction Prediction. I was thinking of ersilia using the code to create the model and then serving the model. Should I include it?
yes, if the data is available we can re train it
SumGNN
SumGNN is a trainer for Drug-Drug Interaction(DDI) prediction which incorporates knowledge summarization graph neural network an improvement from tradition KG(Knowledge Graph) network used in present base trainer. Models produced by SumGNN produced better pharmacological effect prediction score from other trainer by 5.57%. DDI prediction is critical in determining side effects of drugs in people with pre-existing conditions.pharmacological effect prediction score is used to measure the adversity of interaction of two or more drugs. The DDI model can be used to rule out some medicines for patients or suggest change in drug's chemical composition to suite a patient. This will accelerate drug discovery by offering faster and accurate reaction feedback.
https://bitbucket.org/kaistsystemsbiology/deepddi/src/master/data/ http://snap.stanford.edu/biodata/datasets/10017/10017-ChChSe-Decagon.html https://github.com/hetio/hetionet
Above dataset are used to train a predicting model in the trainer.
atomic-mapping
Chemistry Simulation
Regression
https://github.com/yueyu1030/SumGNN
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab207/6189090
None
RXNFP - chemical reaction fingerprints
rxnfp is a model that generates reaction fingerprints(reaction classes with the same reactions with similar reagents and mechanisms). fingerprints enable communication of complicated chemical reaction among scientists. rxnfp model allows for fast fingerprint reaction as the traditional fingerprinting is tedious and long. The model can allows for visual clustering of data and allows for fast similarity searches among chemical reactions.
The model allows scientists to share knowledge faster and faster study of reactions due to presented similarities. This will accelerate drug discovery,
rxn-fingerprint
Fingerprint
Generative
https://github.com/rxn4chemistry/rxnfp/tree/master/
https://chemrxiv.org/engage/chemrxiv/article-details/60c753a0bdbb89acf8a3a4b5
Thanks @samuelmaina !
Can you add those to the model suggestion list? And as next steps, would you be up while you prepare your final application, to try and run the rxn-fingerprint? let's see if it would be easy to incorporate in the Hub!
Have added the model . I have one pending model but I will run the rxn-fingerprint as soon I am done with the third. Thank you
MolScore
molscore is a model for autoscoring de-novo compound. The scoring is used for evaluation of de novo molecular design when the model are shared among scientists. De novo designs are useful in exploration of broader chemical space, they have improved therapy which improves drug candidates in a cost and timely manner
mol-score
Scoring De Novo Designs
Regressive
https://github.com/MorganCThomas/MolScore
https://chemrxiv.org/engage/chemrxiv/article-details/6253f5f66c989c04f6b40986
HEy @samuelmaina !
MolScore seems very useful for generative models in reinforcement learning (for example, REINVENT). Won't be able to add it to the Hub directly because it needs to be coupled to a generative model but I definitely will see if we can implement it for some of our projects, thanks!
If you have time, test the rxn-fps as next steps :)
Thanks very much madam @GemmaTuron . I am running the model . Will give you the results when am done. Thank very much
@GemmaTuron Run rxn-fps model successfully. Produced results as expected. I run into git ssh issues when downloading the code from github but that can be solved by google(bing) searcn. Got the desired results.
(rxnfp) samuelmayna@SAM:~/rxnfp_project$ python project_1.py
256
[-2.0174951553344727, 1.760203242301941, -1.3323537111282349, -1.1095025539398193, 1.2254549264907837]
Hello @GemmaTuron , Any task I can do?
Hi @samuelmaina !
It would be fantastic if you can share a bit more about the steps you took to running rxn so that we can reproduce it (like, which packages did you have to install etc)
Also @samuelmaina it would be great if you can have a look at this issue and let us know if the problems are persisting! https://github.com/ersilia-os/ersilia/issues/343
For rxn, I used conda. I ran the steps outline in the github repo which are:
conda create -n rxnfp python=3.6 -y
conda activate rxnfp
conda install -c rdkit rdkit=2020.03.3 -y
conda install -c tmap tmap -y
git clone git@github.com:rxn4chemistry/rxnfp.git
cd rxnfp
pip install -e .
rdkit module is used to handle compounds and their reactions. tmap is used for visual presentation of compounds.
All the steps ran smoothly except for git clone git@github.com:rxn4chemistry/rxnfp.git
that used SSH(a means of secure communication to ensure the source and the destination are legitimate). I didn't have the SSH keys but after researching in blogs and articles the issue was resolved. One can use the normal git clone using
git clone git@github.com/rxn4chemistry/rxnfp.git
which will not have problems.
To run the the starter code, run python example.py in with example.py containing the example code in the readMe. To see
other functionalities one should visit https://rxn4chemistry.github.io/rxnfp/
@GemmaTuron The final application is asking for answer for specific questions. Do you have any ersilia specific questions? Please can you provide me with the project timeline during the internship?
Hi @samuelmaina ! I've added some info on the Slack channel for the application. Do you want to go ahead and try to incorporae rxn in the Ersilia Model Hub in parallel to preparing the application? check the steps for it in our documentation and let me know!
Yes @GemmaTuron . I will incoporate it. I will also run the model in the issue. Thank very much.
@GemmaTuron I am gettting this error
Model API eos3ae7:predict did not produce an outputTraceback (most recent call last):
File "/home/samuelmayna/eos/repository/eos3ae7/20230328090007_475C16/eos3ae7/artifacts/framework/code/main.py", line 10, in <module>
from chemvae.vae_utils import VAEUtils
File "/home/samuelmayna/eos/repository/eos3ae7/20230328090007_475C16/eos3ae7/artifacts/framework/code/chemvae/vae_utils.py", line 4, in <module>
import yaml
ModuleNotFoundError: No module named 'yaml'
The model does not return an output and there is a problem with yaml dependency I have tried multiple times to install yaml in ersilia but it does not solve the issue eos3ae7_fetch.log
I have also opened a model incorporation issue for the rxn-fingerpring model. Please can you take a look?
Hi @samuelmaina !
I've approved the rxn-fingerprint model, you can go ahead and start working on the repo. For the eos3ae7, thanks for testinvg, could you add this information in the issue for that model so that we can pick it up?
Sure.I will start right away
@GemmaTuron I am running the model in ersilia cli and I am getting this error:
Detailed error:
Model API eos6aun:run did not produce an outputTraceback (most recent call last):
File "/home/samuelmayna/eos/repository/eos6aun/20230329133143_63657D/eos6aun/artifacts/framework/code/main.py", line 10, in <module>
from rxnfp.transformer_fingerprints import (
File "/home/samuelmayna/eos/repository/eos6aun/20230329133143_63657D/eos6aun/artifacts/framework/code/rxnfp/transformer_fingerprints.py", line 20, in <module>
from .tokenization import (
File "/home/samuelmayna/eos/repository/eos6aun/20230329133143_63657D/eos6aun/artifacts/framework/code/rxnfp/tokenization.py", line 12, in <module>
from rdkit import Chem
ModuleNotFoundError: No module named 'rdkit'm getting this error :
The instalation projess has not throw any any errors and I have included the neccesary dependencies in the dockerfile.
FROM bentoml/model-server:0.11.0-py37
MAINTAINER ersilia
RUN conda install -c rdkit rdkit=2020.03.3
RUN conda install -c tmap tmap
RUN pip install rxnfp
WORKDIR /repo
COPY . /repo
I have tried to put only one rdkt and one tmap but it's not working. What could be the problem?
2 comments / suggestions see if that helps:
@GemmaTuron Really sorry for the late reply. After long debugging,the problem was using -c for caching to reduce download time for new downloads in conda install . I was able to serve the model successfully but I am having trouble prodicting with the eml_canonical.csv data it is returning None somewhere but I am working on it. Any advice? From the fetch_logs the single input produced the required output but for eml_canonical.csv I get this error.
return_value = func(*args, **kwargs)
File "/home/samuelmayna/miniconda3/envs/ersilia/lib/python3.7/site-packages/bentoml/cli/click_utils.py", line 99, in wrapper
return func(*args, **kwargs)
File "/home/samuelmayna/ersilia/ersilia/cli/commands/api.py", line 38, in api
api_name=api_name, input=input, output=output, batch_size=batch_size
File "/home/samuelmayna/ersilia/ersilia/core/model.py", line 343, in api
api_name=api_name, input=input, output=output, batch_size=batch_size
File "/home/samuelmayna/ersilia/ersilia/core/model.py", line 357, in api_task
for r in result:
File "/home/samuelmayna/ersilia/ersilia/core/model.py", line 184, in _api_runner_iter
for result in api.post(input=input, output=output, batch_size=batch_size):
File "/home/samuelmayna/ersilia/ersilia/serve/api.py", line 330, in post
results, output, model_id=self.model_id, api_name=self.api_name
File "/home/samuelmayna/ersilia/ersilia/io/output.py", line 283, in adapt
df = self._to_dataframe(result)
File "/home/samuelmayna/ersilia/ersilia/io/output.py", line 229, in _to_dataframe
output_keys_expanded = self.__expand_output_keys(vals, output_keys)
File "/home/samuelmayna/ersilia/ersilia/io/output.py", line 197, in __expand_output_keys
t = self._guess_pure_dtype_if_absent(v)
File "/home/samuelmayna/ersilia/ersilia/io/output.py", line 181, in _guess_pure_dtype_if_absent
return dtype["type"]
TypeError: 'NoneType' object is not subscriptable
Solved the error above
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application