OpenBioSim / biosimspace

An interoperable Python framework for biomolecular simulation.
https://biosimspace.openbiosim.org
GNU General Public License v3.0
71 stars 11 forks source link

[BUG] BSS.Convert.toRDKit() appears to lose stereochemical information #330

Open noahharrison64 opened 1 month ago

noahharrison64 commented 1 month ago

Hey again,

I think I've encountered a bug with the Convert module (or I might be overlooking a crucial bit of code). Converting from a BioSimSpace / Sire mol to an rdmol via BSS.Convert.toRDKit() appears to lose stereochemical info.

To Reproduce Steps to reproduce the behavior:

from rdkit import Chem
import BioSimSpace as BSS

# Define enantiomer
a = 'CC[C@@H](O)C'

# Check stereochemistry of the molecules
a

# Parameterise
bs_a = BSS.Parameters.gaff(a).getMolecule()

# Check the stereochemistry is retained in the parameterised molecules 
bs_a._sire_object.view()

# Convert back to rdkit
rd_a = BSS.Convert.toRDKit(bs_a)
rd_a

Expected behavior It seems to me like there is no reason for the stereochemistry to be removed. The reason this has become an issue for me is because I'm using an rdkit-based flexible alignment method rather than rmsd/rmsdAlign, and so working with rdkit molecules that exhibit correct stereochemical centres is important. Perhaps I've missed something in the docs that'll help me achieve this.

(please complete the following information): OS: [e.g. Linux (Ubuntu 20.04.6 LTS (Focal Fossa) Version of Python: Python 3.12.4 Version of BioSimSpace: 2024.2.0 I confirm that I have checked this bug still exists in the latest released version of BioSimSpace: yes

Cheers, Noah

lohedges commented 1 month ago

Hi there. What's b?

NameError: name 'b' is not defined
noahharrison64 commented 1 month ago

Oops sorry was working with two molecules then cut one out when I wrote this post. Have edited the original post

lohedges commented 1 month ago

I don't think your assertion that the parametrised molecule retains the stereochemistry is correct. If I do the following then everything is fine:

import BioSimSpace as BSS

# Create a molecule directly from the SMILES string.
bss_mol = BSS.Convert.smiles("CC[C@@H](O)C")

# Convert to RDKit.
rdkit_mol = BSS.Convert.toRDKit(bss_mol)

# Check the SMILES representation.
print(Chem.MolToSmiles(rdkit_mol))
'[H]O[C@@]([H])(C([H])([H])[H])C([H])([H])C([H])([H])[H]'

Converting to RDKit uses the Sire-to-RDKit converter behind the scenes. If required attributes are missing, then stereochemistry is preserved by using an algorithm from MDAnalysis, which we have ported and tested reasonably well. However, in your case you could just simply use a BioSimSpace starting molecule that has been created from SMILES, then the required attributes should be preserved throughout, for example:

from rdkit import Chem
import BioSimSpace as BSS

# Define enantiomer
a = 'CC[C@@H](O)C'

# Check stereochemistry of the molecules
a

# Parameterise
bs_a = BSS.Parameters.gaff(BSS.Convert.smiles(a)).getMolecule()

# Check the stereochemistry is retained in the parameterised molecules 
bs_a._sire_object.view()

# Convert back to rdkit
rd_a = BSS.Convert.toRDKit(bs_a)
rd_a

# Print the SMILES representation.
print(Chem.MolToSmiles(rd_a))
'[H]O[C@@]([H])(C([H])([H])[H])C([H])([H])C([H])([H])[H]'

Note the only difference here is:

bs_a = BSS.Parameters.gaff(BSS.Convert.smiles(a)).getMolecule()

This means that the molecule passed in has the required property keys, which are preserved on return, i.e.:

smiles_mol = BSS.Convert.smiles(a)
print(smiles_mol._sire_object.propertyKeys())
['chirality',
 'hybridization',
 'isotope',
 'is_aromatic',
 'coordinates',
 'mass',
 'formal_charge',
 'connectivity',
 'element']

If you just pass in a SMILES string, then these are only attached to an intermediate molecule used in the calculation. I think I might be able to update this so that they are also added. The original SMILES parsing for the parameterisation functions didn't use the internal toRDKit functionality, since it wasn't available. Instead it went via OpenFF, so there was no intermediate BioSimSpace molecule. Will see what I can do since it would be nice if this worked either way.

lohedges commented 1 month ago

This is now fixed in #331.

noahharrison64 commented 1 month ago

Thanks for looking into this.

Regarding parameterised mols not containing stereo-information, how are you able to tell this? When I look at the molecule in 2D (._sire_object.view2d()) it looks like there is a wedge bond.

I'm working with SDFs and only used the butanol smiles as an example. Forgive me if I'm misinterpreting your response, or if your fix tackles SDF mols too, but it seems to me that this issue extends beyond smiles parsing.

For example:

from rdkit import Chem
import BioSimSpace as BSS

smi = 'CC[C@@H](O)C'

# Process smiles in rdkit
mol = Chem.MolFromSmiles(smi)
mol = Chem.AddHs(mol)
Chem.AllChem.EmbedMolecule(mol)

# Write to SDF
Chem.SDWriter('mol.sdf').write(mol)

# Load SDF
bs_mol = BSS.IO.readMolecules('mol.sdf')

# Parameterise
bs_mol_param = BSS.Parameters.gaff(bs_mol.getMolecule(0)).getMolecule()

# Convert to rdkit
rd_conv = BSS.Convert.to(bs_mol_param, "rdkit")

# Check smiles for stereochem
Chem.MolToSmiles(rd_conv)
'[H]OC([H])(C([H])([H])[H])C([H])([H])C([H])([H])[H]'

Every chance I'm missing something here but it feels to me like this could be a slightly different issue to what your fix concerns.

Cheers

lohedges commented 1 month ago

Hmm, I see. It's strange that things look okay using the view2d function, since that's literally using Sire to convert to RDKit behind the scenes. Looking at the code here I see:

        # assign stereochemistry to the rest, and also remove
        # 3D conformers as they mess up the 2D view
        try:
            for r in rdkit_mols:
                # assign the stereochemistry from the structure
                try:
                    from rdkit.Chem.rdmolops import AssignStereochemistryFrom3D

                    AssignStereochemistryFrom3D(r)
                except Exception:
                    # does not matter if this fails
                    pass

                # we also don't want any conformers,
                # as these mess up the 2D view
                r.RemoveAllConformers()
        except Exception:
            try:
                from rdkit.Chem.rdmolops import AssignStereochemistryFrom3D

                AssignStereochemistryFrom3D(rdkit_mols)
            except Exception:
                # does not matter if this fails
                pass

So, it looks like it's using RDkit to apply stereochemistry to the structure using the Python API. Presumably this isn't done in the converter, which uses the C++ API. (I see no equivalent function call.)

No, the fix won't deal with SDF, although SDF attributes should be preserved in the parametrerised molecule and used in the conversion to RDKit.

lohedges commented 1 month ago

In this sense, the view is a bit misleading since the stereochemistry is not part of the BioSimSpace molecule and wouldn't (necessarily) be present when using the Sire-to-RDKit converter.

noahharrison64 commented 1 month ago

Interesting, I wasn't aware of that RDKit function - that's actually very useful. Managed to get around my issues by using it post-conversion, since the 3D information is retained (even if chiral centres weren't being explicitly defined).

lohedges commented 1 month ago

Great, glad it was helpful.