OpenBioSim / biosimspace

An interoperable Python framework for biomolecular simulation.
https://biosimspace.openbiosim.org
GNU General Public License v3.0
75 stars 11 forks source link

Option to write parameterisation intermediary file in SDF format to retain bond information #284

Closed noahharrison64 closed 5 months ago

noahharrison64 commented 6 months ago

Hey @lohedges,

I had an issue with a molecule I was trying to parameterise using gaff2. The antechamber error that arose was:

This molecule may have more than one unit. antechamber can only handle one unit. If the input is a single unit then the connectivity is wrong and the geometry may be bad. Please convert your molecule to a mol2 file via: antechamber -j 5 -at sybyl -dr no And then check your molecule with a visualization program; manually add missing bonds or delete unwanted bonds as appropriate.

I knew this error was slightly non-sensical because the input file is an SDF (that can be fully sanitized using rdkit, and so the bond info is definitely correct). Looking at the source code I saw that BSS writes the molecule to PDB before parsing this file to antechamber.

        # Write the system to a PDB file.
        try:
            _IO.saveMolecules(
                _os.path.join(str(work_dir), "antechamber"),
                new_mol,
                "pdb",
                property_map=self._property_map,
            )
        except Exception as e:
            msg = "Failed to write system to 'PDB' format."
            if _isVerbose():
                msg += ": " + getattr(e, "message", repr(e))
                raise IOError(msg) from e
            else:
                raise IOError(msg) from None

        # Generate the Antechamber command.
        command = (
            "%s -at %d -i antechamber.pdb -fi pdb "
            + "-o antechamber.mol2 -fo mol2 -c %s -s 2 -nc %d"
        ) % (_antechamber_exe, self._version, self._charge_method.lower(), charge)

Since the PDB doesn't contain bonding information, antechamber appears to be unable to correctly interpret the atom distances to approximate the bonds. Since the bond is present in the molecule (mol._sire_object.property("connectivity")) I figured it should be possible to write an intermediary SDF and use this as input to antechamber. (On another note I see that the 'bond' property isn't available for this SDF loaded molecule, even though 'connectivity' is. Would you be able to explain the logic here?)

I therefore monkey-patched the GAFF class with a new run method, simply replacing the PDB format with SDF format:

# Write the system to a PDB file.
        try:
            _IO.saveMolecules(
                _os.path.join(str(work_dir), "antechamber"),
                new_mol,
                "sdf",
                property_map=self._property_map,
            )
        except Exception as e:
            msg = "Failed to write system to 'PDB' format."
            if _isVerbose():
                msg += ": " + getattr(e, "message", repr(e))
                raise IOError(msg) from e
            else:
                raise IOError(msg) from None

        # Generate the Antechamber command.
        command = (
            "%s -at %d -i antechamber.sdf -fi sdf "
            + "-o antechamber.mol2 -fo mol2 -c %s -s 2 -nc %d"
        ) % (_antechamber_exe, self._version, self._charge_method.lower(), charge)

This appeared to work nicely and resolve the antechamber issue.

I therefore wonder whether it would be possible to check if the input molecule has connectivity / bond information, and if it does write an intermediary SDF instead of PDB file. I understand that SDFs are unique in that they're parsed to BSS as both coordinate and topology files and so this might just be appropriate for my use case, but I feel like SDF is becoming the common way of storing molecules. Unfortunately the molecule I'm using that throws the antechamber error is proprietary so I can't share the structure.

Would be good to hear your thoughts!

Cheers, Noah

lohedges commented 6 months ago

Hi there, thanks for this.

On another note I see that the 'bond' property isn't available for this SDF loaded molecule, even though 'connectivity' is. Would you be able to explain the logic here?)

The bond property is the actual bond potential, so would only be populated for a parameterised molecule. The connectivity can be based on bond, but will be autogenerated based on the equilibrium bond distances for the given atoms types in the molecule if not. As such, it's not 100% reliable and will be entirely dependent on the conformer, i.e. it works best when the molecule is minimised.

Sire now has some built in functionality to infer stereochemistry when reading SDF files. (Based on logic from MDAnalysis.) Previously we didn't have this, so would only use SDF as an input fomat for things like OpenFF when the molecule was already loaded from SDF, i.e. where the stereo info was (potentially) already present.

I therefore wonder whether it would be possible to check if the input molecule has connectivity / bond information, and if it does write an intermediary SDF instead of PDB

This should be possible, although I'd need to run some checks in order to make sure that the SDF file always contains the information required by antechamber, e.g. that the naming is always correct. In any case, it would be possible to add a keyword argument such as input_format to allow the user to specify the file format to use as input to antechamber, e.g. PDB, or SDF.

I'll have a play around to see what works best. (It will be easy to switch to SDF and see if any tests break.)

lohedges commented 6 months ago

Are you able to share your example input? I would like to test whether the required info in the SDF can be recovered if we write to PDB, then reload as SDF. If not, then I would take the approach used for the OpenFF parameterisation functions, i.e. only write to SDF if it was loaded from SDF and use PDB otherwise.

lohedges commented 6 months ago

As a note, there is code to generate a CONECT record for the PDB format, but having added this by default I found that it caused more issues than it was worth due to poor conformers.

noahharrison64 commented 6 months ago

The connectivity can be based on bond, but will be autogenerated based on the equilibrium bond distances for the given atoms types in the molecule if not.

Is the connectivity / bond info ever taken directly from the SDF? Since it's already present it shouldn't need to be inferred from atom-atom distances?

As a note, there is code to generate a CONECT record for the PDB format, but having added this by default I found that it caused more issues than it was worth due to poor conformers.

This is good to know, how do you go about doing that?

Unfortunately I can't share the molecule since it's not publicly available, I know that's not very helpful, although I'm happy to test on my side if you make any changes!

chryswoods commented 6 months ago

The connectivity is read from the SDF file. It is only inferred when no bonding information is provided in the input file (e.g. pdb).

chryswoods commented 6 months ago

(the connectivity property stores the actual connections between atoms and is populated based on information from the input file. The bond property is the actual algebraic expressions that correspond to bond potentials. Sire treats these as different things)

noahharrison64 commented 6 months ago

Got it, thanks for clarifying

lohedges commented 6 months ago

Is the connectivity / bond info ever taken directly from the SDF? Since it's already present it shouldn't need to be inferred from atom-atom distances?

Sorry, yes, it as Christopher says, it will be set if the information is there, but inferred if not. We specifically choose not to set it for PDB since the CONECT info is often missing or completely incorrect. It's generally caused so many issues that it was easier to just ignore it.

lohedges commented 6 months ago

I've added a "fix" on this branch. This uses the same approach as for the OpenFF protocol, i.e. use SDF as an intermediate if the input was loaded from that format. Could you let me know how you get on? An alternative approach would be to do the above, but let the user specify the desired format, or always write to SDF if the connectivity property is present (although that might be unreliable for the reasons mentioned above).

noahharrison64 commented 6 months ago

Hi @lohedges,

Thanks for the rapid implementation of this! Cloned your fix and it seems like that's done the job nicely! Hopefully other users will find this to be a useful feature to prevent parameterisation errors in the future.

Cheers, Noah

lohedges commented 6 months ago

Thanks for confirming. I'll have a think about whether I want to make this more flexible and will tag you in when I PR.