dan2097 / opsin

Open Parser for Systematic IUPAC Nomenclature. Chemical name to structure conversion
https://opsin.ch.cam.ac.uk
MIT License
153 stars 32 forks source link

Bug for stereochemistry assignment? #217

Open simonmb opened 1 year ago

simonmb commented 1 year ago

I am no expert in IUPAC naming, but for the following name: (1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE

I get following error: Failed to assign CIP stereochemistry, this indicates a bug in OPSIN or a limitation in OPSIN's implementation of the sequence rules

dan2097 commented 1 year ago

This is unfortunately a "limitation in OPSIN's implementation of the sequence rules". The symmetry of the molecule means that the stereocentre at position 4 is dependant on the configuration of the other two stereocentres. (this is also why the 's' in '4s' is lower case, while the other two are upper case; only at position 4 are two of the groups attached to the stereocentre constitutionally the same)

Due to the complexity of the CIP rules, when I find the time I'm considering integrating https://github.com/SiMolecule/centres to handle this type of stereochemistry.

simonmb commented 1 year ago

that would be really helpful as a lot of (especially older) datasets I would like to annotate with structures use these type of names.

simonmb commented 1 year ago

Would it maybe be possible, in a first step, to give a warning that says stereo information of this type was ignored but still get a structure?

dan2097 commented 1 year ago

This can be done via the API as follows:

NameToStructure nts = NameToStructure.getInstance();
NameToStructureConfig n2sconfig = NameToStructureConfig.getDefaultConfigInstance();
n2sconfig.setWarnRatherThanFailOnUninterpretableStereochemistry(true);
OpsinResult o = nts.parseChemicalName("(1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE", n2sconfig);
System.out.println(o.getMessage());
System.out.println(o.getSmiles());

From the command-line this behaviour can be enabled using the -s or --allowUninterpretableStereo flags

The output is C[C@@H]1[C@@H](CC(C1)C)C

simonmb commented 1 year ago

Just one follow up question: Is there a way for to use the OPSIN jar file to get information on the name regarding the stereochemistry? Such as relative cis/trans, cis/trans or enantiomeric information? Even if OPSIN is not able to parse it yet? Just to know what information is available in the name.

dan2097 commented 1 year ago

Tangentially given an input like rel-(1R,2S,5R)-2-Isopropyl-5-methylcyclohexanol if OPSIN is asked for extended SMILES (-o extendedsmi on the command-line) this relative stereochemistry is indicated in the extended SMILES C(C)(C)[C@H]1[C@@H](C[C@@H](CC1)C)O |$_AV:1;;;2;1;6;5;4;3;1;O$,o1:3,4,6| via the OR stereo group o1:3,4,6 I don't appear to have implemented this for cis/trans stereochemistry

In answer to your question, if you use the -v flag, OPSIN will output its parse tree at various stages of processing. For (1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE the following is output:

<molecule name="(1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE">
  <wordRule wordRule="simple" type="full" value="(1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE">
    <word type="full" value="(1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE">
      <substituent>
        <stereoChemistry type="stereochemistryBracket">(1S,2R,4s)-</stereoChemistry>
        <locant>1,2,4-</locant>
        <multiplier type="basic" value="3">tri</multiplier>
        <group type="chain" subType="alkaneStem" value="C" labels="numeric" usableAsAJoiner="yes">meth</group>
        <suffix type="inline" value="yl">yl</suffix>
      </substituent>
      <root>
        <cyclo value="cyclo">cyclo</cyclo>
        <group type="chain" subType="alkaneStem" value="CCCCC" labels="numeric" usableAsAJoiner="yes">pent</group>
        <unsaturator value="1" locant="1">ane</unsaturator>
      </root>
    </word>
  </wordRule>
</molecule>

which later becomes:

<molecule name="(1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE">
  <wordRule wordRule="simple" type="full" value="(1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE">
    <word type="full" value="(1S,2R,4s)-1,2,4-TRIMETHYLCYCLOPENTANE">
      <substituent multiplier="3" locant="1,2,4">
        <stereoChemistry locant="1" type="RorS" value="S" stereoGroup="Abs">1S</stereoChemistry>
        <stereoChemistry locant="2" type="RorS" value="R" stereoGroup="Abs">2R</stereoChemistry>
        <stereoChemistry locant="4" type="RorS" value="S" stereoGroup="Abs">4s</stereoChemistry>
        <multiplier type="basic" value="3">tri</multiplier>
        <group type="chain" subType="alkaneStem" value="C" labels="numeric" usableAsAJoiner="yes">meth</group>
        <suffix type="inline" value="yl">yl</suffix>
      </substituent>
      <root>
        <group type="ring" subType="alkaneStem" value="C1CCCC1" labels="numeric">pent</group>
        <unsaturator value="1" locant="1">ane</unsaturator>
      </root>
    </word>
  </wordRule>
</molecule>

Stereochemistry is applied as one of the final structures as the whole chemical structure may be needed to resolve Cahn-Ingold-Prelog stereochemistry.

simonmb commented 1 year ago

Thank you very much! Would implementing the verbose output for cis and trans be very time consuming, or does the code already just discard this info and it would just be a matter of printing?

I am looking at a reliable way to find out whether a IUPAC name has some type of stereochemistry information (Z, E, R, S, cis, trans, @, @@), or do you thinks this could be done with a regex? I would not know how to cover all possible options, this is why I wanted to use OPSIN for it,

dan2097 commented 1 year ago

The verbose output does also handle cis/trans e.g. <stereoChemistry type="cisOrTrans" value="trans" subsequentUnsemanticToken="-">trans</stereoChemistry>

I think the full list of types are: EorZ, RorS, cisOrTrans, alphaOrBeta, relativeCisTrans, opticalRotation, endoExoSynAnti, axial, RAC, REL, dlStereochemistry, carbohydrateConfigurationalPrefix. (and also stereochemistryBracket although that's just converted to one or more of these)

I hadn't really anticapted this use case so accessing the intermediate parse trees (which is what the verbose output is showing) is not implemented that elegantly from either the API or command-line. While in principle you should be able to do this with regular expressions I think the expression may end up very complicated. cf. https://github.com/dan2097/opsin/blob/9fe64611808288765c524c669adc66c11330e855/opsin-core/src/main/resources/uk/ac/cam/ch/wwmm/opsin/resources/regexTokens.xml#L75C9-L75C9 A regular expression based approach does have the advantage that it can be applied even to chemical names OPSIN doesn't recognize e.g. stereochemical modifications of trivial names.

While normally IUPAC names apply stereochemistry information to a skeleton that lacks this, especially in biochemistry there are many names were the name implicitly conveys sterochemistry e.g. cholestane.

Stereochemistry indicated via light rotation prefixes like (+) and (-) are recognized by OPSIN but not applied to the structure as the relationship between this physical phenonomem and the structure cannot be easily deduced.

Asssuming you do want to consider all cases where the molecule is implied to have a known stereochemical configuration I think the simplest way to achieve this is probably from the Java API e.g.

import uk.ac.cam.ch.wwmm.opsin.NameToStructure;
import uk.ac.cam.ch.wwmm.opsin.NameToStructureConfig;
import uk.ac.cam.ch.wwmm.opsin.OpsinResult;
import uk.ac.cam.ch.wwmm.opsin.OpsinResult.OPSIN_RESULT_STATUS;
import uk.ac.cam.ch.wwmm.opsin.OpsinWarning;
import uk.ac.cam.ch.wwmm.opsin.OpsinWarning.OpsinWarningType;
import uk.ac.cam.ch.wwmm.opsin.ParsingException;

public class HasStereochemistry {

    public static void main(String[] args) throws ParsingException {
        NameToStructure nts = NameToStructure.getInstance();
        NameToStructureConfig config = new NameToStructureConfig();
        config.setWarnRatherThanFailOnUninterpretableStereochemistry(true);
        OpsinResult result = nts.parseChemicalName("(-)-camphor", config);
        if (result.getStatus() != OPSIN_RESULT_STATUS.FAILURE) {
            String smiles = result.getSmiles();
            boolean hasStereochemistry = smiles.contains("@") || smiles.contains("/") || smiles.contains("\\");
            for (OpsinWarning warning : result.getWarnings()) {
                if (warning.getType() == OpsinWarningType.STEREOCHEMISTRY_IGNORED) {
                    hasStereochemistry = true;
                }
            }
            System.out.println(hasStereochemistry);
        }
    }
}

If on the other hand you're trying to determine if the the structure is capable of having multiple stereoisomers you should be able to determine this in a chemistry toolkit from the SMILES, regardless of whether OPSIN outputs it with stereochemistry. As a minor caveat, exotic sterochemistry like axial stereochemistry may not be obvious to toolkits from the structure.