Closed tobigithub closed 1 year ago
Good suggestion! Related to https://github.com/aspuru-guzik-group/selfies/issues/87
@whitead Thanks for coming back so quickly. Just to clarify, I was wondering if there is a way to create "unique unique" or canonical SELFIES? Because randomized SMILES, SELFIES, DeepSMILES and canonical SMILES are all very different variants and require different solutions.
The original 2020 SELFIE paper states that one could make canonical SMILES by translating them to SELFIES and back?
- Standardization outlook The SELFIES concept still requires work to become a standard. Upon publication of this article, the authors will call for a workshop to extend the format to the entire periodic table, allow for stereochemistry, polyvalency, aromaticity, isotopic substitution and other special cases so that all the features present in SMILES are available in SELFIES. Unicode will be employed to create readable symbols that exploit the flexibility of modern text systems without restricting oneself to ASCII characters. In that context, we will pursue to define direct canonicalization of SELFIES, such that there is a canonical SELFIES string for a unique molecule. Currently, SMILES can be made canonical indirectly, by translating them to SELFIES and convert the canonical SMILES back to SELFIES.
Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation https://iopscience.iop.org/article/10.1088/2632-2153/aba947
SMILES. 2. Algorithm for generation of unique SMILES notation http://organica1.org/seminario/smile_2_1988.pdf
Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI https://link.springer.com/article/10.1186/1758-2946-4-22
Randomized SMILES strings improve the quality of molecular generative models https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0393-0
Atomic ring invariant and Modified CANON extended connectivity algorithm for symmetry perception in molecular graphs and rigorous canonicalization of SMILES https://link.springer.com/article/10.1186/s13321-020-00453-4
Correct @tobigithub - currently the only way to get canonical SELFIES is to use an external tool like rdkit first.
def canonicalize(smiles):
return Chem.MolToSmiles(Chem.MolFromSmiles(smiles), canonical=True)
Hi, I wonder if the SELFIES algorithm can create canonical SMILES/SELFIES natively? Or is that something that has to be done with RDKit?
Output:
Except for the first SMILES code, which is zwitterionic on multiple atoms, the rest should be caffeine. So here are the 5 examples with the same InChiKey and also same SMILES with the RDKIt canonizer (zwitterion excluded)
So I wonder what is the correct way to create canonical SELFIES or SMILES without using RDKIT or any other external code? Thank you! Tobias