SiMolecule / centres

Perception and labelling of stereogenic centres in chemical structures
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

How to properly handle non-elemental labels consistently? #9

Closed iclkevin closed 1 year ago

iclkevin commented 1 year ago

If a non-elemental label that cannot be expanded is present in a structure, for instance someone types "place-substituent-here" as an atom label in a drawing, the CENTRES library can be inconsistent on features that depend on the non-elemental atom in their CIP digraph.

For the SMILES O[C@H](C(N)C)C(C)O, if I change the nitrogen label to something else, when creating the BaseMol child, I return 0 for the getAtomicNum() function for that atom. If I then shuffle the getBonds() Iterable in the BaseMol child, I will get a mix of R and null results for the chiral center. If I return -1 for the getAtomicNum() function for the non-elemental label, I will always get R. I believe the proper result should always be null or not a stereocenter.

I think this should be supported consistently, as there may be stereochemical features in parts of the structure that are fully resolvable via localized evaluation of CIP digraphs, and a non-elemental label on the other side of the structure shouldn't stop the evaluation of the clearly defined features. So I don't think we should simply cancel evaluation for structures with a non-elemental label.

What do you suggest is the best way to handle these cases? Please let me know if you need more information.

iclkevin commented 1 year ago

The SMILES string seems to contain markup sensitive to Github, I am going to try to put it in code tags:

O[C@H](C(N)C)C(C)O

johnmay commented 1 year ago

So it should be undefined, O[C@H](C(*)C)C(C)O the issue is similar to the other #6 where by a wildcard atom can be less or great when comparing to an atom. For example:

O[C@H](C([NH])C)C(C)O => R (because N < O)
O[C@H](C([Si])C)C(C)O => S (because Si > O)

Note if the tie can be split before it hits a '*' it is unambigous:

CCC[C@H](O)CC* => ambiguous
CNC[C@H](O)CC* => un-ambiguous N > C in the layer before
CCC[C@H](O)CN* => un-ambiguous N > C in the layer before

It should not be affected by ordering of bonds, to clarify do you change this after the fact or on input?

Incase I hadn't mentioned yet I would not these routines for SMILES canonicalisation etc and purely just naming/visual aid on depictions. There are much more efficient algorithms for doing canonical stereo that doesn't need to confirm to IUPAC's (dare I say arbitrary) rules.

iclkevin commented 1 year ago

I was only providing the example in SMILES, I can provide it in MOL as well. I am merely trying to implement the BaseMol child for our own CIP analysis. We aren't inputting SMILES or MOL into the command line functions. So if such a non-elemental label is present, I am not sure what the BaseMol.getAtomicNum() function should return, and if the ordering of the bonds is shuffled before labelling, the output is inconsistent when 0 is returned by the BaseMol.getAtomicNum() function or R when -1 is returned. I think it should label the center as null in all cases.

iclkevin commented 1 year ago

Just to clarify, the chiral center is the one receiving the CIP label, and that has a legitimate elemental label. When evaluating the CIP digraph for the chiral center, if a substituent node has a non-elemental label, and there is CIP equivalency amongst the branches up to that point, then I believe the CENTRES algorithm should stop and not provide any CIP label to the chiral center.

johnmay commented 1 year ago

Found the issue, bit more involved that the other one. I may also try and tackle the polymer issue as well.

iclkevin commented 1 year ago

I apologize for making you work on a weekend. I am happy to help out, please let me know if you need me to test anything.

johnmay commented 1 year ago

Ah it's all good, the generic/wildcard it was kind of added as an after-thought. One option was just to throw an exception but really it would be nicer if we name the centres we can and leave those without enough info as undefined.

iclkevin commented 1 year ago

All of my tests pass with the latest source. Thank you, John.

johnmay commented 1 year ago

Thank you for finding the issues, these corner cases are always interesting.

johnmay commented 1 year ago

I'm going to fix the polymer issue as well for CDK and then will do a release. It's a bit hard to provide an abstract API to wrap this but it shouldn't be too hard to support this on your data structures too. Basically if you have a polymer (Head-to-tail) like this: *(C(=O)C[C@H](O)CC)* we can repeat it three times and then take the label from the middle repeat unit.

iclkevin commented 1 year ago

In ChemDoodle, all repeat units are expanded before CIP analysis. Egon may have a better idea, but if you do not want to keep track of which atoms should be mapped back to the original structure, you can attach the two polymer ends together with a bond to create a periodic structure. The bond would have a special parameter, so you would know when you crossed it during CIP digraph growth. Each node in the CIP digraph, when sprouting, would then keep track of a counter (starting at 0) for which period it is in (across the x axis let's say) and inherits this value from its parent node. If you cross the periodic bond, increment the node counter. If your counter hits 2, create a dummy node, as you would in a non-periodic cyclic structure. Now the CIP digraph will be able to automatically resolve periodic structures in as many dimensions as you have repeat groups defined.

This may also work better for pseudo-asymmetric centers, where they depend on a configuration in the next period. If you just extend the polymer length to 3 units, then your end units may not have the configurations defined required for the pseudo-asymmetric center to resolve in the center unit. I cannot think of a good example though.

johnmay commented 1 year ago

I tried the cyclisation trick first but given these is special handling in CIP for ring closures and I'm also not convinced of the pseudo-asymmetric cases - I would rather give no answer than a wrong one hence the 3 repeats.

*C[C@H](O)CC[C@H](O)C* => C1[C@H](O)CC[C@H](O)C1

Best

iclkevin commented 1 year ago

I trust your decision on these topics :).