Canonicalize PSMILES

IMPORTANT NOTE: The code and data shared here is available for academic non-commercial use only

I recommend using the psmiles Python package that integrates canonicalization and other tools to work with PSMILES.

PSMILES (Polymer SMILES) is a chemical language to represent polymer structures. PSMILES strings have two stars ([*] or *) symbols that indicate the two endpoints of the polymer repeat unit and otherwise follow the daylight SMILES syntax defined at OpenSmiles. Developed as part of arXiv.

The raw PSMILES syntax is ambiguous and non-unique; i.e., the same polymer may be written using many PSMILES strings:

Polyethylene	Polyethylene oxide	Polypropylene
`[]C[]`	`[]CCO[]`	`[]CC([])C`
`[]CC[]`	`[]COC[]`	`[]CC(CC([])C)C`
`[]CCC[]`	`[]OCC[]`	`CC([])C[]`

The canonicalization routine of the PSMILES packages finds a canonicalized version of the SMILES string by

Finding the shortest representation of a PSMILES string

[*]CCOCCO[*] -> [*]CCO[*]

Making the PSMILES string cyclic

[*]CCO[*] -> C1 CCO C1

Applying the canonicalization routine as implemented in RDKit

C1 CCO C1 -> C1 COC C1

Breaking the cyclic bond

C1 COC C1 -> [*]COC[*]

Install

pip install git+https://github.com/Ramprasad-Group/canonicalize_psmiles.git

Polyethylene	Polyethylene oxide	Polypropylene
`[]C[]`	`[]CCO[]`	`[]CC([])C`
`[]CC[]`	`[]COC[]`	`[]CC(CC([])C)C`
`[]CCC[]`	`[]OCC[]`	`CC([])C[]`

Ramprasad-Group / canonicalize_psmiles

readme

Canonicalize PSMILES

IMPORTANT NOTE: The code and data shared here is available for academic non-commercial use only

Install

How to use