How to copy the coordinates from a non-parameterised molecules to a parameterised molecule which has no coordinates

DreamCykl commented 8 months ago

I have a use case of BioSimSpace that takes an SDF as input, parameterises the molecule using the GAFF forcefield, writes out just a grotop file. In truth, I do write a PDB file too but I want to only store the grotop files. This way I can easily pull the grotop file if I want to re-parameterise the molecule (without paying the cost of re-running parameterisation) with a different starting set of coordinates (i.e. the PDB file written after parameterisation is not relevant). To achieve this I need to be able to read in a new set of coordinates (from an SDF) and combine these with the parameters of the existing grotop file.

sdf_mol = BSS.IO.readMolecules(sdf).getMolecule(0) grotop_mol = BSS.IO.readMolecules(grotop).getMolecule(0)

I saw the makeCompatibleWith() method but found I couldn't do sdf_mol.makeCompatibleWith(grotop_mol) and going the other way doesn't save coordinates from the sdf. How would you suggest getting the coordinates from the SDF and applying it to the a grotop_mol?

lohedges commented 8 months ago

If you weren't using SDF, then this would be trivial since you can load the GroTop along with another file, e.g. a PDB, and it will apply the additional properties from the PDB file to the system set up with the topology from the GroTop. SDF is treated as a topology format and you can't parse with two topologies, i.e. it doesn't know which one to start from.

I'll have a look at why makeCompatibleWith isn't working, since that would have been my suggestion. It's trivial to apply new coordinates in the Sire layer, so I'll post a snippet showing how to do that if I can't get it working.

DreamCykl commented 8 months ago

I'd be happy to copy coordinates via sire if necessary, I'm just a bit unsure about how to match atoms before copying coordinates (incase atom names aren't retained)

lohedges commented 8 months ago

When I'm back next week I'll figure out a way of cleanly exposing our private updateCoordinatesAndVelocities function. This is used internally by the process objects to map updated coordinates and velocities from trajectory or restart files into the original topology. This handles reordering and name changes (with some limitations) and allows you to control the molecule and atom mapping between representations. This might help achieve what you want in a more general way.

DreamCykl commented 8 months ago

Okay thank you!

lohedges commented 7 months ago

Apologies for the delay. I'm still working on exposing this in a clean way. For now you can do something like the following:

In [1]: import BioSimSpace as BSS

In [2]: m = BSS.Parameters.gaff("C").getMolecule()

In [3]: BSS.IO.saveMolecules("test", m, "grotop,sdf")
Out[3]:
['/home/lester/Code/openbiosim/biosimspace/demo/test.top',
 '/home/lester/Code/openbiosim/biosimspace/demo/test.sdf']

In [4]: grotop_mol = BSS.IO.readMolecules("test.top")[0]

In [5]: sdf_mol = BSS.IO.readMolecules("test.sdf")[0]

In [6]: c = grotop_mol._sire_object.cursor()

In [7]: c["coordinates"] = sdf_mol._sire_object.property("coordinates")

In [8]: new_mol = BSS._SireWrappers.Molecule(c.commit())

In [9]: new_mol.coordinates()
Out[9]:
[(0.0000e+00 A, 0.0000e+00 A, 0.0000e+00 A),
 (-0.6780 A, 0.8740 A, -0.0860 A),
 (-0.4100 A, -0.8460 A, -0.5890 A),
 (0.0850 A, -0.2980 A, 1.0650 A),
 (1.0030 A, 0.2710 A, -0.3900 A)]

In [10]: new_mol._sire_object.property("bond")
Out[10]:
TwoAtomFunctions( size=4
0:    C1:1-H2:2    : 330.6 [r - 1.0969]^2
1:    C1:1-H3:3    : 330.6 [r - 1.0969]^2
2:    C1:1-H4:4    : 330.6 [r - 1.0969]^2
3:    C1:1-H5:5    : 330.6 [r - 1.0969]^2
)

lohedges commented 7 months ago

The reason that makeCompatibleWith doesn't work for SDF is that the matching functions (that map atoms between the two molecular representations) rely either having the same residue naming conventions in the two topologies, or having the same coordinates. (We know the second can't be true, since one doesn't have coordinates.)

DreamCykl commented 7 months ago

Hi, @lohedges

Many thanks for the reply, that has worked perfectly, happy to use this code block since it's only a few lines long!

lohedges commented 7 months ago

Perfect, thanks for the update. I'll try to remember to tag you in when I expose this functionality at the BioSimSpace level. It's a little bit more tricky since I want to deal with different topology representations, so will need to map atoms between layouts.

Cheers.

DreamCykl commented 5 months ago

Hi @lohedges

I've got a few worries that this code block isn't working as expected. Is there a chance that applying the coordinates from the SDF to the grotop-loaded molecule requires more complex coordinate-atom matching?

To give an example of my concern I follow this procedure:

Load input SDF
Parameterise input SDF
Save parameterised system to grotop file
Load grotop file, load input SDF, copy coordinates from input SDF and apply to grotop-loaded molecule
Save new molecule generated in 4) as SDF
Try and load SDF into RDKit to check 'sanity' of the molecule, find that the molecules can't be loaded, suggesting process has failed.

Here's a code block recreating the process described above

from rdkit import Chem
import BioSimSpace as BSS
from pathlib import Path

def parameterise_sdf(sdf, output_filebase):
    mol = BSS.IO.readMolecules([sdf])
    parameterised = BSS.Parameters.gaff2(mol.getMolecule(0)).getMolecule()
    return BSS.IO.saveMolecules(output_filebase, parameterised, ['grotop'])[0]

def create_sdf_from_sdf_coords_and_top(sdf, top_file, output_filebase):
    parameterised_mol = BSS.IO.readMolecules(str(top_file))[0]
    cursor = parameterised_mol._sire_object.cursor()
    sdf_system = BSS.IO.readMolecules([str(sdf)])
    cursor["coordinates"] = sdf_system[0]._sire_object.property("coordinates")
    new_mol = BSS._SireWrappers.Molecule(cursor.commit())  
    new_sdf = BSS.IO.saveMolecules(str(output_filebase), new_mol, ["sdf"])[0]
    return new_sdf

output_data = Path('output_data')
rdmols = []

# I used a random ligand benchmark set for this example, but presumably this issue applies across all molecules
for i, sdf in enumerate(list(Path('tyk2_ligands/').glob('*.sdf'))):

    # First append the original RDKit molecule to check the input SDF can create a valid molecule
    rdmols.append(
        [Chem.SDMolSupplier(str(sdf))[0]]
    )

    # Now parameterise the SDF, merge the coordinates from the SDF to the output GROTOP and save as a new SDF
    top_file = parameterise_sdf(str(sdf), str(output_data / sdf.stem))
    new_sdf = create_sdf_from_sdf_coords_and_top(
        str(sdf), top_file, str(output_data / sdf.stem)
    )

    # Read in the new SDF and append the rdmol. 
    rdmols[i].append(
        Chem.SDMolSupplier(str(new_sdf[0]))
    )

When looking at the 'rdmols' list, each list item gives [<rdkit.Chem.rdchem.Mol at 0x7ff601581a80>, None], showing that doing this merging of SDF coordinates and grotop bond info cannot create a valid molecule. I wonder if simply copying the coordinates directly to the grotop-loaded molecule without any specific atom matching is causing this. Any advice/thoughts would be appreciated!

lohedges commented 5 months ago

Hi there. I'm a bit confused by your code. This bit here...

    # Read in the new SDF and append the rdmol. 
    rdmols[i].append(
        Chem.SDMolSupplier(str(new_sdf[0]))
    )

uses str(new_sdf[0]). In my case, the value of new_sdf after the first loop is:

'/home/lester/Code/michellab/RBFE-Benchmark/output_data/lig_ejm47.sdf'

So str(new_sdf[0])) is equal to '/', which causes the SDMolsupplier will to fail. If I instead use...

    # Read in the new SDF and append the rdmol. 
    rdmols[i].append(
        Chem.SDMolSupplier(new_sdf)
    )

then everything seems to work as expected.

There shouldn't be any issue matching atoms in this way as long as the atoms in the topology and SDF file are in the same order, which is the case here. (If you read a GroTop along with a coordinate file I think there is some additional logic to try to match the atoms by name if order doesn't work, but this isn't done here.)

DreamCykl commented 5 months ago

Hey, sorry about that - I edited the code after pasting into github and thought I'd made the correct amendments, clearly I hadn't.

Made sure I didn't have any silly typo bugs like this in the code block and I'm still encountering the issue. Did you try accessing the mols from the SDMolSupplier object instantiated with the new_sdf?

Here are some files generated during the process, lig-ejm-45.sdf (input), lig-ejm-45.top (parameterised top), new-lig-ejm-45.sdf (new SDF created by merging with the GroTop file). I'm unable to read in new-lig-ejm-45.sdf with RDKit (even with sanitize=False) which suggests to me the SDF is invalid. For what its worth I'm able to read it in with BSS and convert to something approximating an RDKit mol (although it doesn't fails to convert to smiles and back to rdmol).

...as long as the atoms in the topology and SDF file are in the same order, which is the case here.

Okay I think this is what was causing my issue - I haven't been guaranteeing that the atoms are going to be in the same order. I could potentially use RDKit's CanonicalRankAtoms to canonicalise the input molecules to ensure the order will always be the same. As for reasons I think that were highlighted previously in this thread, I'm unable to read SDFs and Top files concurrently presumably because they have different atom naming conventions.

lohedges commented 5 months ago

It's not possible to read them concurrently because both SDF and GroTop are classed as topology file formats, so Sire doesn't know which to use to instantiate the system. If you just loaded the SDF then wrote back to PDB, then you could load the PDB and GroTop at the same time to create a system in one go.

I'm unable to read in new-lig-ejm-45.sdf with RDKit (even with sanitize=False) which suggests to me the SDF is invalid.

Sorry, I hadn't realised that you had created a supplier rather than a RDKit molecule. I get None for this molecule when I try to access the supplier. I'll try to see why this is.

DreamCykl commented 5 months ago

If you just loaded the SDF then wrote back to PDB, then you could load the PDB and GroTop at the same time to create a system in one go.

That's a good point, I don't know why I hadn't tried doing that already. Do you think going via an intermediate PDB file will make merging of SDF with GroTop atom order agnostic?

If I use this method instead of copying across coordinates I still seem to encounter an issue with writing the parameterised molecule back to SDF, i.e. replacing the create_sdf_from_sdf_coords_and_top func with

def create_sdf_from_sdf_coords_and_top(sdf, top_file, output_filebase):
    mol = BSS.IO.readMolecules([str(sdf)])
    pdb = BSS.IO.saveMolecules('intermediate.sdf', mol, ['pdb'])[0]
    parameterised_mol = BSS.IO.readMolecules([str(top_file), str(pdb)])
    new_sdf = BSS.IO.saveMolecules(str(output_filebase), parameterised_mol, ["sdf"])[0]
    return new_sdf

still produces non-valid SDFs.

lohedges commented 5 months ago

It does work if you just use the built in RDKit conversion from Sire, e.g. if you modified your script to do:

from rdkit import Chem
import BioSimSpace as BSS
from pathlib import Path

def parameterise_sdf(sdf, output_filebase):
    mol = BSS.IO.readMolecules([sdf])
    parameterised = BSS.Parameters.gaff2(mol.getMolecule(0)).getMolecule()
    return BSS.IO.saveMolecules(output_filebase, parameterised, ['grotop'])[0]

def create_sdf_from_sdf_coords_and_top(sdf, top_file, output_filebase):
    parameterised_mol = BSS.IO.readMolecules(str(top_file))[0]
    cursor = parameterised_mol._sire_object.cursor()
    sdf_system = BSS.IO.readMolecules([str(sdf)])
    cursor["coordinates"] = sdf_system[0]._sire_object.property("coordinates")
    new_mol = BSS._SireWrappers.Molecule(cursor.commit())
    new_sdf = BSS.IO.saveMolecules(str(output_filebase), new_mol, ["sdf"])[0]
    return new_sdf, new_mol

output_data = Path('output_data')
rdmols = []

# I used a random ligand benchmark set for this example, but presumably this issue applies across all molecules
for i, sdf in enumerate(list(Path('inputs/tyk2/ligands/').glob('*45.sdf'))):

    # First append the original RDKit molecule to check the input SDF can create a valid molecule
    rdmols.append(
        [Chem.SDMolSupplier(str(sdf))[0]]
    )

    # Now parameterise the SDF, merge the coordinates from the SDF to the output GROTOP and save as a new SDF
    top_file = parameterise_sdf(str(sdf), str(output_data / sdf.stem))
    new_sdf, new_mol = create_sdf_from_sdf_coords_and_top(
        str(sdf), top_file, str(output_data / sdf.stem)
    )

    # Read in the new SDF and append the rdmol.
    rdmols[i].append(BSS.Convert.toRDKit(new_mol))

You could also bypass all of this and just set the coordinates of the conformer directly in RDKit, e.g.:

  from rdkit.Geometry import Point3D
  conf = mol.GetConformer()
  for i in range(mol.GetNumAtoms()):
     x,y,z = # Coordinates from the Sire molecule.
     conf.SetAtomPosition(i,Point3D(x,y,z))

Just to note that it does look like RDKit will preserve the atom order on load. For example, this is the positions section of the SDF file for ejm45 that I have locally:

lig_ejm_45
     RDKit          3D

 39 41  0  0  1  0  0  0  0  0999 V2000
   -4.7653   -2.7760  -16.4617 H   0  0  0  0  0  0  0  0  0  0  0  0
   -5.3593   -3.6410  -16.2030 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.7748   -4.9180  -16.1991 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.5374   -6.0484  -15.8455 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.9029   -5.9091  -15.4832 C   0  0  0  0  0  0  0  0  0  0  0  0
   -7.4906   -4.6189  -15.5147 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.7175   -3.4985  -15.8728 C   0  0  0  0  0  0  0  0  0  0  0  0
   -7.1710   -2.5202  -15.8954 H   0  0  0  0  0  0  0  0  0  0  0  0
   -9.1625   -4.3708  -15.1484 Cl  0  0  0  0  0  0  0  0  0  0  0  0
   -7.7142   -7.0985  -15.0517 C   0  0  0  0  0  0  0  0  0  0  0  0
   -8.1252   -7.1792  -13.8939 O   0  0  0  0  0  0  0  0  0  0  0  0
   -7.9143   -8.0102  -16.0203 N   0  0  0  0  0  0  0  0  0  0  0  0
   -7.4505   -7.8164  -16.8957 H   0  0  0  0  0  0  0  0  0  0  0  0
   -8.7301   -9.1752  -16.0199 C   0  0  0  0  0  0  0  0  0  0  0  0
   -9.5662   -9.5758  -14.9563 C   0  0  0  0  0  0  0  0  0  0  0  0
  -10.3667  -10.7163  -15.1327 C   0  0  0  0  0  0  0  0  0  0  0  0
  -10.3622  -11.4650  -16.2560 N   0  0  0  0  0  0  0  0  0  0  0  0
   -9.5398  -11.1166  -17.2754 C   0  0  0  0  0  0  0  0  0  0  0  0
   -8.7272   -9.9574  -17.1911 C   0  0  0  0  0  0  0  0  0  0  0  0
   -8.1074   -9.6434  -18.0179 H   0  0  0  0  0  0  0  0  0  0  0  0
   -9.5624  -11.9160  -18.4518 N   0  0  0  0  0  0  0  0  0  0  0  0
  -10.4093  -12.4586  -18.5637 H   0  0  0  0  0  0  0  0  0  0  0  0
   -8.6555  -12.0363  -19.4383 C   0  0  0  0  0  0  0  0  0  0  0  0
   -7.5594  -11.4760  -19.4513 O   0  0  0  0  0  0  0  0  0  0  0  0
   -9.0625  -12.9425  -20.6104 C   0  0  0  0  0  0  0  0  0  0  0  0
   -9.7771  -12.3994  -21.2296 H   0  0  0  0  0  0  0  0  0  0  0  0
   -9.5820  -13.8314  -20.2521 H   0  0  0  0  0  0  0  0  0  0  0  0
   -7.8922  -13.3639  -21.4848 C   0  0  0  0  0  0  0  0  0  0  0  0
   -7.0388  -14.5520  -21.1047 C   0  0  0  0  0  0  0  0  0  0  0  0
   -7.9413  -14.6258  -22.3142 C   0  0  0  0  0  0  0  0  0  0  0  0
   -7.4723  -14.5952  -23.2971 H   0  0  0  0  0  0  0  0  0  0  0  0
   -8.8440  -15.2291  -22.2526 H   0  0  0  0  0  0  0  0  0  0  0  0
   -7.3325  -15.1164  -20.2240 H   0  0  0  0  0  0  0  0  0  0  0  0
   -5.9664  -14.4567  -21.2681 H   0  0  0  0  0  0  0  0  0  0  0  0
   -7.3673  -12.5289  -21.9445 H   0  0  0  0  0  0  0  0  0  0  0  0
  -11.0351  -11.0381  -14.3489 H   0  0  0  0  0  0  0  0  0  0  0  0
   -9.6382   -9.0281  -14.0304 H   0  0  0  0  0  0  0  0  0  0  0  0
   -4.7515   -7.5847  -15.8530 Cl  0  0  0  0  0  0  0  0  0  0  0  0
   -3.7310   -5.0243  -16.4574 H   0  0  0  0  0  0  0  0  0  0  0  0

If I load this in RDKit and check the conformer positions, then I see:

In [1]: from rdkit import Chem

In [2]: suppl = Chem.SDMolSupplier("inputs/tyk2/ligands/lig_ejm45.sdf", removeHs=False)

In [3]: mol = suppl[0]

In [4]: conf = mol.GetConformer()

In [5]: for i in range(mol.GetNumAtoms()):
   ...:     pos = conf.GetAtomPosition(i)
   ...:     print(i, pos.x, pos.y, pos.z)
   ...:
0 -4.7653 -2.776 -16.4617
1 -5.3593 -3.641 -16.203
2 -4.7748 -4.918 -16.1991
3 -5.5374 -6.0484 -15.8455
4 -6.9029 -5.9091 -15.4832
5 -7.4906 -4.6189 -15.5147
6 -6.7175 -3.4985 -15.8728
7 -7.171 -2.5202 -15.8954
8 -9.1625 -4.3708 -15.1484
9 -7.7142 -7.0985 -15.0517
10 -8.1252 -7.1792 -13.8939
11 -7.9143 -8.0102 -16.0203
12 -7.4505 -7.8164 -16.8957
13 -8.7301 -9.1752 -16.0199
14 -9.5662 -9.5758 -14.9563
15 -10.3667 -10.7163 -15.1327
16 -10.3622 -11.465 -16.256
17 -9.5398 -11.1166 -17.2754
18 -8.7272 -9.9574 -17.1911
19 -8.1074 -9.6434 -18.0179
20 -9.5624 -11.916 -18.4518
21 -10.4093 -12.4586 -18.5637
22 -8.6555 -12.0363 -19.4383
23 -7.5594 -11.476 -19.4513
24 -9.0625 -12.9425 -20.6104
25 -9.7771 -12.3994 -21.2296
26 -9.582 -13.8314 -20.2521
27 -7.8922 -13.3639 -21.4848
28 -7.0388 -14.552 -21.1047
29 -7.9413 -14.6258 -22.3142
30 -7.4723 -14.5952 -23.2971
31 -8.844 -15.2291 -22.2526
32 -7.3325 -15.1164 -20.224
33 -5.9664 -14.4567 -21.2681
34 -7.3673 -12.5289 -21.9445
35 -11.0351 -11.0381 -14.3489
36 -9.6382 -9.0281 -14.0304
37 -4.7515 -7.5847 -15.853
38 -3.731 -5.0243 -16.4574

I'll try to figure out why the modified SDF file isn't working for this ligand. Simply reading and writing the original SDF with Sire does result in an SDF that can be read by RDKit, e.g:

In [1]: import sire as sr

In [2]: mols = sr.load("inputs/tyk2/ligands/lig_ejm45.sdf")

In [3]: sr.save(mols, "test.sdf")
Out[3]: ['/home/lester/Code/michellab/RBFE-Benchmark/test.sdf']

In [4]: from rdkit import Chem

In [5]: rdmol = Chem.SDMolSupplier("test.sdf")[0]
[14:39:56] Warning: molecule is tagged as 2D, but at least one Z coordinate is not zero. Marking the mol as 3D.

In [6]: rdmol
Out[6]: <rdkit.Chem.rdchem.Mol at 0x79fb5d47cb30>

Originally I assumed that there was some read-write inconsistecy, so some information was lost on write. I'll check the GroTop molecule since it's possible that this isn't in the order we would expect.

lohedges commented 5 months ago

The order appears to be fine:

In [1]: import sire as sr

In [2]: sdf = sr.load("inputs/tyk2/ligands/lig_ejm45.sdf")[0]

In [3]: grotop = sr.load("output_data/lig_ejm45.top")[0]

WARNINGS encountered when parsing the topology:

** Warnings for molecule type GroMolType( name() = 'MOL', nExcludedAtoms() = 3 ) **

Ignoring 85 'pairs' lines
e.g. the first ignored pairs line is
1 4 1====

Silence these warnings by passing `show_warnings=False` when loading.

In [4]: for atom0, atom1 in zip(sdf.atoms(), grotop.atoms()):
   ...:     print(atom0, atom1)
   ...:
Atom( H:1     [  -4.77,   -2.78,  -16.46] ) Atom( H:1 )
Atom( C:2     [  -5.36,   -3.64,  -16.20] ) Atom( C:2 )
Atom( C:3     [  -4.77,   -4.92,  -16.20] ) Atom( C:3 )
Atom( C:4     [  -5.54,   -6.05,  -15.85] ) Atom( C:4 )
Atom( C:5     [  -6.90,   -5.91,  -15.48] ) Atom( C:5 )
Atom( C:6     [  -7.49,   -4.62,  -15.51] ) Atom( C:6 )
Atom( C:7     [  -6.72,   -3.50,  -15.87] ) Atom( C:7 )
Atom( H:8     [  -7.17,   -2.52,  -15.90] ) Atom( H:8 )
Atom( Cl:9    [  -9.16,   -4.37,  -15.15] ) Atom( Cl:9 )
Atom( C:10    [  -7.71,   -7.10,  -15.05] ) Atom( C:10 )
Atom( O:11    [  -8.13,   -7.18,  -13.89] ) Atom( O:11 )
Atom( N:12    [  -7.91,   -8.01,  -16.02] ) Atom( N:12 )
Atom( H:13    [  -7.45,   -7.82,  -16.90] ) Atom( H:13 )
Atom( C:14    [  -8.73,   -9.18,  -16.02] ) Atom( C:14 )
Atom( C:15    [  -9.57,   -9.58,  -14.96] ) Atom( C:15 )
Atom( C:16    [ -10.37,  -10.72,  -15.13] ) Atom( C:16 )
Atom( N:17    [ -10.36,  -11.46,  -16.26] ) Atom( N:17 )
Atom( C:18    [  -9.54,  -11.12,  -17.28] ) Atom( C:18 )
Atom( C:19    [  -8.73,   -9.96,  -17.19] ) Atom( C:19 )
Atom( H:20    [  -8.11,   -9.64,  -18.02] ) Atom( H:20 )
Atom( N:21    [  -9.56,  -11.92,  -18.45] ) Atom( N:21 )
Atom( H:22    [ -10.41,  -12.46,  -18.56] ) Atom( H:22 )
Atom( C:23    [  -8.66,  -12.04,  -19.44] ) Atom( C:23 )
Atom( O:24    [  -7.56,  -11.48,  -19.45] ) Atom( O:24 )
Atom( C:25    [  -9.06,  -12.94,  -20.61] ) Atom( C:25 )
Atom( H:26    [  -9.78,  -12.40,  -21.23] ) Atom( H:26 )
Atom( H:27    [  -9.58,  -13.83,  -20.25] ) Atom( H:27 )
Atom( C:28    [  -7.89,  -13.36,  -21.48] ) Atom( C:28 )
Atom( C:29    [  -7.04,  -14.55,  -21.10] ) Atom( C:29 )
Atom( C:30    [  -7.94,  -14.63,  -22.31] ) Atom( C:30 )
Atom( H:31    [  -7.47,  -14.60,  -23.30] ) Atom( H:31 )
Atom( H:32    [  -8.84,  -15.23,  -22.25] ) Atom( H:32 )
Atom( H:33    [  -7.33,  -15.12,  -20.22] ) Atom( H:33 )
Atom( H:34    [  -5.97,  -14.46,  -21.27] ) Atom( H:34 )
Atom( H:35    [  -7.37,  -12.53,  -21.94] ) Atom( H:35 )
Atom( H:36    [ -11.04,  -11.04,  -14.35] ) Atom( H:36 )
Atom( H:37    [  -9.64,   -9.03,  -14.03] ) Atom( H:37 )
Atom( Cl:38   [  -4.75,   -7.58,  -15.85] ) Atom( Cl:38 )
Atom( H:39    [  -3.73,   -5.02,  -16.46] ) Atom( H:39 )

I can only assume that RDKit uses the coordinates to infer some information and they are causing the problem. I'll compare the SDF file written by RDKit after direct conversion (using .toRDKit) to the one written by Sire.

lohedges commented 5 months ago

The issue is entirely caused by line 4 in the SDF file.

Original:

 39 41  0  0  1  0  0  0  0  0999 V2000

Modified:

 39 41  0  0  0  0  0  0  0  0  0     0

The error that RDKit gives is:

[14:47:14] ERROR: CTAB version string invalid at line 4
[14:47:14] ERROR: moving to the beginning of the next molecule

I'm not sure why this is being modified by Sire. If I simply load the original file and write it back, then I get something that agrees with the original.

 39 41  0  0  1  0  0  0  0  0999 V2000

I'm not sure why this happens, and I can't think of a reason why it only happens for certain ligands in the tyk2 set.

lohedges commented 5 months ago

Okay, the issue is because you are using the GroTop molecule as the reference, i.e. copying the coordinates from the SDF into, then writing to SDF. Specific SDF attributes are lost, since they aren't part of the GroTop molecule. I think you want to do the opposite, i.e. use makeCompatibileWith to transfer all of the additional GroTop properties (including updated coordinates) into the SDF molecule, then writing that to file. For example, I now see the following in the SDF output.

def create_sdf_from_sdf_coords_and_top(sdf, top_file, output_filebase):
    parameterised_mol = BSS.IO.readMolecules(str(top_file))[0]
    sdf_mol = BSS.IO.readMolecules([str(sdf)])[0]
    sdf_mol.makeCompatibleWith(parameterised_mol)
    new_sdf = BSS.IO.saveMolecules(str(output_filebase), sdf_mol, ["sdf"])[0]
    return new_sdf

With this I get:

lig_ejm_45
  -Sire-202400200

 39 41  0  0  1  0  0  0  0  0999 V2000
   -4.7653   -2.7760  -16.4617 H   0  0  0  0  0  0  0  0  0  0  0  0
   -5.3593   -3.6410  -16.2030 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.7748   -4.9180  -16.1991 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.5374   -6.0484  -15.8455 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.9029   -5.9091  -15.4832 C   0  0  0  0  0  0  0  0  0  0  0  0
...

I guess you wanted to do it the other way around since you can't guarantee the same naming. In this case things match, so makeCompatibileWith works. For now I'd just use the direct RDKit conversion, since it should work regardless of the naming:

def create_rdmol_from_sdf_coords_and_top(sdf, top_file):
    parameterised_mol = BSS.IO.readMolecules(str(top_file))[0]
    cursor = parameterised_mol._sire_object.cursor()
    sdf_system = BSS.IO.readMolecules([str(sdf)])
    cursor["coordinates"] = sdf_system[0]._sire_object.property("coordinates")
    new_mol = BSS._SireWrappers.Molecule(cursor.commit())
    rdmol = BSS.Convert.toRDKit(new_mol)
    return rdmol

DreamCykl commented 5 months ago

Using the create_rdmol_from_sdf_coords_and_top function I'm unable to create fully validated molecules (If I write the rdmol to SDF and then read back in, it fails to sanitize. Equally converting rdmol -> smiles -> rdmol fails). I wonder if you're finding this too?

Using the makeCompatible() method seems to work in this instance, but as you say I can't guarantee the same naming.

Just to re-iterate what my original intention was and why I created this thread in the first place - I wanted to be able to store just the GroTop files and then re-use these files for any conformer of the same molecule, as oppose to having to re-parameterise each time I supply a new conformer. For this purpose I needed to be able to supply a conformer of a molecule via an SDF, obtain forcefields parameters by pairing it with a pre-generated GroTop file, and then write a PDB file which matches the naming convention of the GroTop file. I then use the PDB and GroTop files as input into my simulation pipeline.

My main concern is not necessarily that I can't seem to do a round-trip SDF -> Param mol -> SDF, it was more to do with the importance of order in copying the coordinates directly onto a GroTop molecule. If I supplied conformers with different atom orders in the SDF, then the pre-generated GroTop file may not be applicable to this conformer / sdf. I think this could explain how I ended up with a PDB file (generated from merging a pre-generated GroTop file with an input SDF) that moved an oxygen from its phenol position to being embedded in the ring system.

I think if I try and canonicalise each SDF so that all molecules have the same ordering then I should be able to get around this issue.

OpenBioSim / biosimspace

How to copy the coordinates from a non-parameterised molecules to a parameterised molecule which has no coordinates #268