Ramprasad-Group / canonicalize_psmiles

Tool for the canonicalization of Polymer SMILES (P🙂) strings
Other
18 stars 3 forks source link

Canonicalize PSMILES

IMPORTANT NOTE: The code and data shared here is available for academic non-commercial use only

I recommend using the psmiles Python package that integrates canonicalization and other tools to work with PSMILES.

PSMILES (Polymer SMILES) is a chemical language to represent polymer structures. PSMILES strings have two stars ([*] or *) symbols that indicate the two endpoints of the polymer repeat unit and otherwise follow the daylight SMILES syntax defined at OpenSmiles. Developed as part of arXiv.

The raw PSMILES syntax is ambiguous and non-unique; i.e., the same polymer may be written using many PSMILES strings:

Polyethylene Polyethylene oxide Polypropylene
[*]C[*] [*]CCO[*] [*]CC([*])C
[*]CC[*] [*]COC[*] [*]CC(CC([*])C)C
[*]CCC[*] [*]OCC[*] CC([*])C[*]

The canonicalization routine of the PSMILES packages finds a canonicalized version of the SMILES string by

  1. Finding the shortest representation of a PSMILES string

[*]CCOCCO[*] -> [*]CCO[*]

  1. Making the PSMILES string cyclic

[*]CCO[*] -> C1 CCO C1

  1. Applying the canonicalization routine as implemented in RDKit

C1 CCO C1 -> C1 COC C1

  1. Breaking the cyclic bond

C1 COC C1 -> [*]COC[*]

Install

pip install git+https://github.com/Ramprasad-Group/canonicalize_psmiles.git

How to use

See also test.ipynb

from canonicalize_psmiles.canonicalize import canonicalize

smiles = "[*]NC(C)CC([*])=O"
print(smiles)
print(canonicalize(smiles))