dan2097 / opsin

Open Parser for Systematic IUPAC Nomenclature. Chemical name to structure conversion
https://opsin.ch.cam.ac.uk
MIT License
135 stars 29 forks source link

Aromatic SMILES #139

Open merkys opened 3 years ago

merkys commented 3 years ago

As of v2.5.0, OPSIN outputs kekulized SMILES (benzene is translated to C1=CC=CC=C1). If the information about the ring aromaticity is known to OPSIN, output of aromatic SMILES (benzene translated to c1ccccc1) would be very beneficial, as algorithms to aromatize kekulized structures are not straightforward. It would be best to have both output forms available, controllable via a command line option.

dan2097 commented 3 years ago

OPSIN does internally have a concept of an atom having "maximum number of non-cumulative double bonds" which does roughly correspond to aromaticity in SMILES. but there are differences. In OPSIN's internal format it's not incorrect to represent pyrrole as n1cccc1,which is invalid in SMILES*. My understanding is that most toolkits do have a method for percieving aromaticity so I'm not that clear on the use case. In your proposal would you expect benzene and cyclohexa-1,3,5-triene to have different SMILES?

* In this case OPSIN does actually use [nH]1cccc1 with the hydrogen on the N being interpreted as a hint that if unspecified pyrrole should be assumed to be 1H-pyrrole

merkys commented 3 years ago

My understanding is that most toolkits do have a method for percieving aromaticity so I'm not that clear on the use case.

I am doing analysis of SMILES without toolkits. Aromaticity perception from scratch requires identification of rings, and this is already quite cumbersome and computationally intensive.

By the way, OpenSMILES specification seems to recommend the aromatic form:

The Kekule form is always acceptable for SMILES input. For output, the aromatic form (using lowercase letters) is preferred. The lowercase symbols eliminate the arbitrary choice of how to assign the single and double bonds, and provide a normalized form that more accurately reflects the electronic configuration.

It also discusses that aromatic form is preferable in matching via SMARTS.

In your proposal would you expect benzene and cyclohexa-1,3,5-triene to have different SMILES?

Good point. Most likely not, as cyclohexa-1,3,5-triene is aromatic, so I expect both to be c1ccccc1.

simonmb commented 1 year ago

@merkys You could just use rdkit to do aromaticity perception. Usually it is quite fast. I have been using it. Or what is the reason not to use post-procerssing through a toolkit?