I recommend using the psmiles
Python package that integrates canonicalization and other tools to work with PSMILES.
PSMILES (Polymer SMILES) is a chemical language to represent polymer structures. PSMILES strings have two stars ([*]
or *
) symbols that indicate the two endpoints of the polymer repeat unit and otherwise follow the daylight SMILES syntax defined at OpenSmiles. Developed as part of arXiv.
The raw PSMILES syntax is ambiguous and non-unique; i.e., the same polymer may be written using many PSMILES strings:
Polyethylene | Polyethylene oxide | Polypropylene |
---|---|---|
[*]C[*] |
[*]CCO[*] |
[*]CC([*])C |
[*]CC[*] |
[*]COC[*] |
[*]CC(CC([*])C)C |
[*]CCC[*] |
[*]OCC[*] |
CC([*])C[*] |
The canonicalization routine of the PSMILES
packages finds a canonicalized version of the SMILES string by
[*]CCOCCO[*]
-> [*]CCO[*]
[*]CCO[*]
-> C1 CCO C1
C1 CCO C1
-> C1 COC C1
C1 COC C1
-> [*]COC[*]
pip install git+https://github.com/Ramprasad-Group/canonicalize_psmiles.git
See also test.ipynb
from canonicalize_psmiles.canonicalize import canonicalize
smiles = "[*]NC(C)CC([*])=O"
print(smiles)
print(canonicalize(smiles))