OpenBioSim / biosimspace

An interoperable Python framework for biomolecular simulation.
https://biosimspace.openbiosim.org
GNU General Public License v3.0
71 stars 11 forks source link

Matching atoms and merging ring systems (fused rings, different hybridisation states) - issues #103

Open annamherz opened 1 year ago

annamherz commented 1 year ago

Hello! I'm currently matching and merging molecules for MCL1 ligands. The basic code used to replicate these is:

# ligands
lig0 = "lig_27"
lig1 = "lig_42"
# options
prematch = {}
ring_size = False
ring_break = False
complete_ring = True

lig_0 = BSS.IO.readMolecules([f"{lig0}.rst7", f"{lig0}.prm7"])
lig_1 = BSS.IO.readMolecules([f"{lig1}.rst7", f"{lig1}prm7"])

mapping = BSS.Align.matchAtoms(
                            lig_0, lig_1,
                            complete_rings_only=complete_ring,
                            prematch=prematch,
                            )    

inv_mapping = {v: k for k, v in mapping.items()}
lig_1_a = BSS.Align.rmsdAlign(lig_1, lig_0, inv_mapping)

merged_ligands = BSS.Align.merge(lig_0, lig_1_a, mapping,
                                allow_ring_breaking=ring_break,
                                allow_ring_size_change=complete_ring
                                  )

With the files named as the ligands attached. ligands.zip

There are some different situations that occur in relation to ring systems, outlined below.

Case 1

In the most basic case, as in the case of some ligands (lig_27~lig_42), this does not proceed correctly initially. Many of the atoms are included in the perturbable region and some of the shared atoms are not included in the MCS. image

When the mapping dictionary {21:6} is added as prematch, this is then able to proceed okay. Prematch for this ligand series is chosen so the N in the core region match. image Notably, this is able to proceed with ring_size=False and ring_break=False and the ring is mapped correctly to the fused ring system, which is not the case in Case 2.

Case 2

However in a another case, lig_27~lig_43, when the mapping dictionary{ 21:26 } which is also the N is added, this is not able to proceed without setting ring_size=True, ring_break=True . In this case then the resulting perturbed region is like below: image I think ideally, with the introduced mapping, this should still be able to proceed with ring_size and ring_break set as False, and the whole ring region in this case is taken as the perturbed region in each case. So for lig_27 the phenyl ring and for lig_43 the entire fused ring system. Then if these were both set to True, in that case the phenyl ring should ideally be mapped onto an entire ring in the fused phenyl as opposed to split between the two fused rings. The difference between lig_42 and lig_43 in this case is minor, with lig_42 being a 1-napthyl group and lig_43 being a 2-napthyl group, but this causes a large difference in the mapping.

Case 3

Here, lig_43~lig_45, there is a change of hybridisation state with the rings. Again, without prematch, the core region of the molecule is ignored: image

And when prematch {26:6} is introduced, with ring_size=False, ring_break=False it then maps the differently hybridised C to each other. The differently hybridised atoms should not map to each other as this is not great for the free energy perturbation. image

I was wondering if there would be some way to programmatically address the mapping and merging ring perturbations so the intended perturbable molecule can be obtained? I guess the atoms could also be removed manually from the generated mapping that is passed into the merge function, however this requires quite some time per perturbation to do manually as each atom needs to be inspected then.

Thank you!

lohedges commented 1 year ago

Thanks, I'll take a closer look when I get a chance. We had quite a few tests for weird ring-size and ring-breaking edge cases, but unfortunately most of these now need to be marked as xfail due to changes in the RDKit API generating different mappings. If I work out what's going on here, then I'll add these systems as examples. It's quite hard to figure out all of the possible ways in which this can/can't occur and essentially involves manually looking at the connectivity of the molecule (as based on the mapping) and working things out.

Out of interest, do you have the original SDF files used to load the ligands, i.e. prior to parameterisation? I wonder whether these would give the same mappings. (The SDF specific attributes would be retained as properties of the molecules.) We now use Sire's built in to_rdkit() functionality when performing the MCS, but this might be more robust if going via a molecule originally loaded from SDF, e.g. if stererochemistry is important, or can't be inferred any other way.

Just to note that when converting to RDKit with Sire (say for ligands 27 and 42) I get valence warnings, and if I attempt to compute a mapping using the LomaAtomMapper from openfe then I get the same warning and no mapping.

In [1]: import sire as sr

In [2]: from gufe import SmallMoleculeComponent

In [3]: from openfe.setup.atom_mapping import LomapAtomMapper

In [4]: m0 = sr.load("ligands/*27*")

In [5]: m1 = sr.load("ligands/*42*")

In [6]: rdm0 = sr.convert.to_rdkit(m0)
[15:10:12] Explicit valence for atom # 1 C, 5, is greater than permitted

In [7]: rdm1 = sr.convert.to_rdkit(m1)
[15:10:14] Explicit valence for atom # 8 C, 5, is greater than permitted

In [8]: openfe_m0 = SmallMoleculeComponent(rdm0)

In [9]: openfe_m1 = SmallMoleculeComponent(rdm1)

In [10]: mapper = LomapAtomMapper()

In [11]: mapping_gen = mapper.suggest_mappings(openfe_m0, openfe_m1)

In [12]: mapping = next(mapping_gen)
[15:10:53] Explicit valence for atom # 1 C, 5, is greater than permitted
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:1                                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
StopIteration
annamherz commented 1 year ago

Out of interest, do you have the original SDF files used to load the ligands, i.e. prior to parameterisation?

Yep, here are the sdf files: sdf_ligands.zip I parameterised them with the OpenFF Sage.

I just tried running the above with the sdf files and didn't get any errors when converting them to rdkit or mapping.

I guess I've also been using the older BSS; I'll run the above again through the newer BSS and see if any of the mapping problems persist!

annamherz commented 1 year ago

I've run the initial code with the most recent version of the devel, which should use the sr.convert.to_rdkit internally as far as I can see. I still have the same issues, even when using the sdf files instead. I think the mapping issue is more a result of using the rdFMCS.FindMCS :

        # Generate the MCS match.
        mcs = _rdFMCS.FindMCS(
            mols,
            atomCompare=_rdFMCS.AtomCompare.CompareAny,
            bondCompare=_rdFMCS.BondCompare.CompareAny,
            completeRingsOnly=complete_rings_only,
            ringMatchesRingOnly=True,
            matchChiralTag=False,
            matchValences=False,
            maximizeBonds=False,
            timeout=timeout,
        )

as LomapAtomMapper() gave a different mapping when I ran it. Eg for lig_27~lig_42:

# bss
{0: 34, 1: 10, 2: 9, 3: 8, 4: 7, 5: 12, 6: 11, 7: 35, 8: 36, 37: 33}
# atom mapper
LigandAtomMapping(componentA=SmallMoleculeComponent(name=MOL), componentB=SmallMoleculeComponent(name=MOL), componentA_to_componentB={0: 39, 1: 19, 2: 18, 3: 17, 4: 16, 5: 25, 6: 20, 7: 21, 8: 24, 9: 0, 10: 1, 11: 26, 12: 27, 13: 2, 14: 28, 15: 29, 16: 3, 17: 30, 18: 31, 19: 4, 20: 5, 21: 6, 22: 7, 23: 8, 24: 9, 25: 10, 26: 11, 27: 12, 28: 36, 29: 35, 30: 34, 31: 33, 32: 32, 33: 13, 34: 14, 35: 15, 36: 37, 37: 38}, annotations={})
lohedges commented 1 year ago

Okay, thanks for letting me know. I'll compare options with the LomapMapper. Our rdFMCS settings originally were chosen for consistency with FESetup, so may not be optimal. I also think that openfe only matches heavy atoms, then uses some internal logic of its own to map other atoms afterwards.

lohedges commented 1 year ago

(Sorry, forgot to say that LOMAP calls the same RDKit MCS matcher internally, but likely using different options.)

annamherz commented 1 year ago

When I have some time I can also try cycling some of my perturbations through different options to see how that changes the perturbed region. I'm wondering if it would be worth checking the number of perturbing atoms after the MCS has run, and if this is above a certain number (which is generally how I noticed that the mapping was not ideal) running it through the MCS again with different options? Depending on how the other options impact the mapped region

annamherz commented 1 year ago

Hello! I have an update - I've managed to look through some of the different options here:

        # Generate the MCS match.
        mcs = _rdFMCS.FindMCS(
            mols,
            atomCompare=_rdFMCS.AtomCompare.CompareAny,
            bondCompare=_rdFMCS.BondCompare.CompareAny,
            completeRingsOnly=complete_rings_only,
            ringMatchesRingOnly=True,
            matchChiralTag=False,
            matchValences=False,
            maximizeBonds=False,
            timeout=timeout,
        )

I've found that, for the perturbations above (lig_27\~lig_42, lig_27\~lig_43, lig_43\~lig_45), having completeRingsOnly set to False does reduce the number of perturbed atoms, however results in same problems as outlined above for when prematch is used. Changing RingMatchesRingOnly to False adds some H to the perturbable region for lig_43\~lig_45 that should be mapped. Changin the other options ( matchChiralTag, matchValences, or maximizeBonds ) do not result in the correct mapping either.

Using atomCompare=_rdFMCS.AtomCompare.CompareAnyHeavyAtom or bondCompare=_rdFMCS.BondCompare.CompareOrder or bondCompare=_rdFMCS.BondCompare.CompareOrderExact also don't really improve the mappings for these three perturbations.

lohedges commented 1 year ago

Thanks for the update. I've looked into swappaing out our MCS with the LomapMapper from openfe but at present it is not reliable enough. This is largely due to only being able to resolve a fairly old version of openfe due to other requirements in our environment. The change was trivial to implement, though, so I plan to revisit at a later date to see if it works more reliably.