baoilleach / deepsmiles

DeepSMILES - A variant of SMILES for use in machine-learning
MIT License
132 stars 30 forks source link

Shift closure values #7

Open adalke opened 5 years ago

adalke commented 5 years ago

The closure values "0" and "1" will never be seen in the current DeepSMILES. C0 is meaningless, and C1 has a loop to itself.

Proposal 1: Shift the closure numbers so that "CC0" corresponds to what is currently "CC2".

The closure value "2" can only be seen with dot disconnections, as for example C.C2. Otherwise, a 2 always links to the previous atom, as CC2 or CN)C2. If #6 is implemented, such that closures cannot cross a dot disconnection, then the closure value "2" will never exist in a valid DeepSMILES.

Proposal 2: Shift the closure numbers so that "CCC0" corresponds to what is currently "CCC3".

This would make the closure values 0, 1, and 2 be useful.

baoilleach commented 5 years ago

Currently, the closure value of 6 indicates a six-membered ring. I think that a stronger argument would need to be made to give up this nice feature.

adalke commented 5 years ago

I see your point. You describe the project as "A variant of SMILES for use in machine-learning". I don't think machine-learning systems do better if '6' indicates a six-membered ring, instead of '4'.

However, I believe your point is that "SMILES", at least as Weininger envisions it, is fundamentally meant for humans, so should include some of that human-centered worldview to be honestly called a SMILES variant.

I can respect that decision.

I think there's also room for a variant which is less SMILES-ish in syntax but easier to use by naive systems (algorithms which don't specifically know the (Deep)SMILES grammar).