aspuru-guzik-group / selfies

Robust representation of semantically constrained graphs, in particular for molecules in chemistry
Apache License 2.0
631 stars 125 forks source link

Canonical SELFIES #97

Closed tobigithub closed 1 year ago

tobigithub commented 1 year ago

Hi, I wonder if the SELFIES algorithm can create canonical SMILES/SELFIES natively? Or is that something that has to be done with RDKit?

# SELFIES roundtrip based on example
# [Ref1] https://www.arxiv-vanity.com/papers/2302.03620/
# [Ref2] https://iopscience.iop.org/article/10.1088/2632-2153/aba947

import selfies as sf

def roundtrip ( smiles ):
  try :
    selfies = sf . encoder ( smiles )
    return sf . decoder ( selfies )
  except sf . EncoderError :
    return None

# six different SMILES for caffeine (InChiTrust), 172 can be found publicly, >4000 are possible
# Six example SMILES: https://www.inchi-trust.org/technical-faq-2/
# 4160 SMILES for caffeine:
# Source: https://nextmovesoftware.com/blog/2014/07/15/how-do-i-write-thee-let-me-count-the-ways/

caffeine = [
"[c]1([n+]([CH3])[c]([c]2([c]([n+]1[CH3])[n][cH][n+]2[CH3]))[O-])[O-]",
"CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12",
"Cn1cnc2n(C)c(=O)n(C)c(=O)c12",
"Cn1cnc2c1c(=O)n(C)c(=O)n2C",
"O=C1C2=C(N=CN2C)N(C(=O)N1C)C",
"CN1C=NC2=C1C(=O)N(C)C(=O)N2C"]

for item in caffeine:
  print("SMILES: ",item)

print()

for item in caffeine:
  item = roundtrip(item)
  print("Selfie: ",item)

Output:

SMILES:  [c]1([n+]([CH3])[c]([c]2([c]([n+]1[CH3])[n][cH][n+]2[CH3]))[O-])[O-]
SMILES:  CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12
SMILES:  Cn1cnc2n(C)c(=O)n(C)c(=O)c12
SMILES:  Cn1cnc2c1c(=O)n(C)c(=O)n2C
SMILES:  O=C1C2=C(N=CN2C)N(C(=O)N1C)C
SMILES:  CN1C=NC2=C1C(=O)N(C)C(=O)N2C

Selfie:  None
Selfie:  CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12
Selfie:  CN1C=NC=2N(C)C(=O)N(C)C(=O)C1=2
Selfie:  CN1C=NC2=C1C(=O)N(C)C(=O)N2C
Selfie:  O=C1C2=C(N=CN2C)N(C(=O)N1C)C
Selfie:  CN1C=NC2=C1C(=O)N(C)C(=O)N2C

Except for the first SMILES code, which is zwitterionic on multiple atoms, the rest should be caffeine. So here are the 5 examples with the same InChiKey and also same SMILES with the RDKIt canonizer (zwitterion excluded)

# OpenBabel SMILES to InchiKey
RYYVLZVUVIJVGH-UHFFFAOYSA-N
RYYVLZVUVIJVGH-UHFFFAOYSA-N
RYYVLZVUVIJVGH-UHFFFAOYSA-N
RYYVLZVUVIJVGH-UHFFFAOYSA-N
RYYVLZVUVIJVGH-UHFFFAOYSA-N

# RDKIT canonizer
# mol2smi(mol, isomericSmiles=False, canonical=True)
RDKIT SMILES:  Cn1c(=O)c2c(ncn2C)n(C)c1=O
RDKIT SMILES:  Cn1c(=O)c2c(ncn2C)n(C)c1=O
RDKIT SMILES:  Cn1c(=O)c2c(ncn2C)n(C)c1=O
RDKIT SMILES:  Cn1c(=O)c2c(ncn2C)n(C)c1=O
RDKIT SMILES:  Cn1c(=O)c2c(ncn2C)n(C)c1=O

So I wonder what is the correct way to create canonical SELFIES or SMILES without using RDKIT or any other external code? Thank you! Tobias

whitead commented 1 year ago

Good suggestion! Related to https://github.com/aspuru-guzik-group/selfies/issues/87

tobigithub commented 1 year ago

@whitead Thanks for coming back so quickly. Just to clarify, I was wondering if there is a way to create "unique unique" or canonical SELFIES? Because randomized SMILES, SELFIES, DeepSMILES and canonical SMILES are all very different variants and require different solutions.

The original 2020 SELFIE paper states that one could make canonical SMILES by translating them to SELFIES and back?

  1. Standardization outlook The SELFIES concept still requires work to become a standard. Upon publication of this article, the authors will call for a workshop to extend the format to the entire periodic table, allow for stereochemistry, polyvalency, aromaticity, isotopic substitution and other special cases so that all the features present in SMILES are available in SELFIES. Unicode will be employed to create readable symbols that exploit the flexibility of modern text systems without restricting oneself to ASCII characters. In that context, we will pursue to define direct canonicalization of SELFIES, such that there is a canonical SELFIES string for a unique molecule. Currently, SMILES can be made canonical indirectly, by translating them to SELFIES and convert the canonical SMILES back to SELFIES.

Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation https://iopscience.iop.org/article/10.1088/2632-2153/aba947

SMILES. 2. Algorithm for generation of unique SMILES notation http://organica1.org/seminario/smile_2_1988.pdf

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI https://link.springer.com/article/10.1186/1758-2946-4-22

Randomized SMILES strings improve the quality of molecular generative models https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0393-0

Atomic ring invariant and Modified CANON extended connectivity algorithm for symmetry perception in molecular graphs and rigorous canonicalization of SMILES https://link.springer.com/article/10.1186/s13321-020-00453-4

whitead commented 1 year ago

Correct @tobigithub - currently the only way to get canonical SELFIES is to use an external tool like rdkit first.

def canonicalize(smiles):
    return Chem.MolToSmiles(Chem.MolFromSmiles(smiles), canonical=True)