baoilleach / deepsmiles

DeepSMILES - A variant of SMILES for use in machine-learning
MIT License
132 stars 30 forks source link

Indexing off in decode when DeepSMILES string has >= 100 rings #17

Closed chrisjonesBSU closed 3 years ago

chrisjonesBSU commented 3 years ago

This seems similar to #5, but the changes made in the resulting PR don't seem to address the issues that I'm running into.

I've been using the deepsmiles format and conversion code to build up SMILES strings for polymers from the string of just the monomer. The process has been working great until my resulting polymer deepsmiles string is for a molecule with 100 or more rings.

It looks like the issue is the assumption that there are always 2 digits following a % sign, and therefore the indexing skips ahead by 2.

in def decode_branches() of decode.py

 96         elif x == '%':
 97             if i == 0:
 98                 raise exceptions.DecodeError(deepsmiles, i, "'%' not allowed as first character")
 99             bondchar = deepsmiles[i-1] if i > 0 and deepsmiles[i-1] in bondchars else ""
100             if deepsmiles[i+1] == '(':
101                 closebracket = deepsmiles.find(')', i+2)
102                 if closebracket == -1:
103                     raise exceptions.DecodeError(deepsmiles, i, "'%(' is missing the corresponding close parenthesis")
104                 digit = int(deepsmiles[i+2:closebracket])
105                 i = closebracket
106             else:
107                 try:
108                     digit = int(deepsmiles[i+1:i+3])
109                 except ValueError:
110                     raise exceptions.DecodeError(deepsmiles, i, "'%' should be followed by two digits")
111                 i += 2

The smiles string that is returned in this case is incorrect.

johnmay commented 3 years ago

You can’t have more than 99 rings in a smiles, if there are three or more digits it is read as multiple ring closures:

Several toolkits support the extension of brackets to allow >99 rings. %(999). In practise the number of molecules that need these is very very small unless the smiles is generated in a bad order and visits chains before ring (which I guess is possible for a generator).

chrisjonesBSU commented 3 years ago

I see, that makes sense. I'll have to look into the use of brackets.

unless the smiles is generated in a bad order and visits chains before ring (which I guess is possible for a generator).

This must be what is happening with my workflow then, as the molecules seem to generate correctly, even with >100 ring closures.

Thanks for the advice, I'm still relatively new/unfamiliar with SMILES.

baoilleach commented 3 years ago

I think @johnmay has cleared this up. Check the SMILES documentation (http://opensmiles.org/opensmiles.html#ringclosure) if in doubt.