datasets input of SMILES canonicalization model

bigchem / transformer-cnn

Transformer CNN for QSAR/QSPR modelling

93 stars 27 forks source link

datasets input of SMILES canonicalization model #5

Open DreamMemory001 opened 3 years ago

DreamMemory001 commented 3 years ago

First of all, i reckon that is a fantastic work. I want to ask some problem about it:

i: Fig.1 in this paper, Benzylpenicillin canonical SMILES is but i get it in the website of ChEMBL is include Fig.3, the canonical SMILES of CHEMBL351484 is different from website of ChEMBL. And i use rdkit to get these canonical SMILES, i get the same result as website of ChEMBL Because of them, i get a little confused. ii: i want to ask you where i can find datasets input of SMILES canonicalization model. Just as 17,657,995 canonicalization pairs written in reactions format separated by ‘ >> ’. Each pair contained on the left side a non-canonical, and on the right side—a canonical SMILES for the same molecule.

I hope to get your reply. Thanks.

carpovpv commented 3 years ago

Hi,

unfortunately, there are many implementations of canonicalization. I do not remember exactly what program and version we used to make the picture. Generally, I use OpenBabel and RDKit. Nonetheless, the point is that you can make "canonicalized" and "random" SMILES with the help of the software you are currently using. If you are consistent with training and prediction, you will the results we described in the paper. The original code used RDkit but then I started working on a standalone version for ordinary people. It turned out that using Openbabel is much convenient in that context.

Concerning the dataset, I can upload it somewhere. But you can make the dataset yourself easily.

DreamMemory001 commented 3 years ago

Thanks for your reply, i use rdkit version-2021.3.5. but when i plot this canonical SMILES in Fig.1, it return None. I get confused.
9781f40df8d3ec4ae418d0200d4e2aa 561961c0a395875b6e3014e670b1d0f c77a209049c77feacc868c784714bbc

Finally, I hope you can give me a link of your datasets input of SMILES canonicalization model. Because i want to get the format of your datasets. Thank you very much.

DreamMemory001 commented 3 years ago

I don't know if you saw my comment. If you have spare time, I hope you can give me a brief answer. Thank you very much.