Output PDB file formatting potentially invalid for proper ligand parsing

amorehead commented 3 months ago

Hello.

After inspecting the formatting of RFAA's structure prediction PDB files for (multi-ligand) protein-ligand complexes, it seems that the formatting for ligand (HETATM) atoms associated with each ligand "residue" (in PDB verbiage) is not compatible with well-known Python PDB file parsing libraries such as BioPython. In particular, it looks like this is because RFAA's output PDB files (for ligand predictions) do not follow PDB file formatting standards (as discussed in https://stackoverflow.com/questions/64032005/is-there-a-way-to-obtain-a-full-list-of-atoms-from-a-ligand-residue).

Is there by chance a simple modification one can make to the RFAA source code to assign unique (alt loc) IDs to each ligand atom (with respect to a given residue) so that the output PDB files can be properly parsed?

Example PDB file output showing "duplicated" ligand atoms for a particular ligand "residue" (e.g., LG1, Chain B):

HETATM10382    C LG1 B 773       9.660   2.895  -6.589  1.00  0.19           C
HETATM10383    C LG1 B 773       8.378   0.813  -6.232  1.00  0.20           C
HETATM10384    C LG1 B 773       7.378   0.065  -7.060  1.00  0.20           C
HETATM10385    N LG1 B 773       4.514   1.394  -9.239  1.00  0.20           N
HETATM10386    N LG1 B 773       1.865   3.033 -11.687  1.00  0.20           N
HETATM10387    N LG1 B 773       6.597   6.307  -7.844  1.00  0.20           N
HETATM10388    N LG1 B 773       9.322   1.545  -6.967  1.00  0.20           N
HETATM10389    O LG1 B 773       5.982   2.445 -10.644  1.00  0.21           O
HETATM10390    O LG1 B 773       1.884   1.542 -13.390  1.00  0.20           O

amorehead commented 3 months ago

In case anyone else needs a solution to this issue, one can (for the time being) use the following BioPandas function to extract all hetero-atoms uniquely for each ligand "residue".

import numpy as np
from biopandas.pdb import PandasPdb

def insert_unique_ligand_atom_name(input_pdb_file: str, output_pdb_file: str):
    """
    Add unique identifier to ligand atoms with the same residue ID and chain ID.

    :param input_pdb_file: Path to the input PDB file.
    :param output_pdb_file: Path to the output PDB file.
    """
    pdb = PandasPdb().read_pdb(input_pdb_file)
    ligand_atoms = pdb.df["HETATM"]
    ligand_atoms["atom_name"] = ligand_atoms["atom_name"] + np.arange(1, len(ligand_atoms) + 1).astype(str)
    pdb.df["HETATM"] = ligand_atoms
    pdb.to_pdb(output_pdb_file)

amorehead commented 3 months ago

Related to this issue, I am having difficulty parsing the predicted ligand structures back into the original (e.g., RDKit) conformer for each input SMILES molecule. It seems the inference code uses OpenBabel to read each input SMILES string/SDF file, and the SMILES strings parsed from each pybel.Molecule(obmol) object (via pybel.Molecule(obmol).write("smi")) differ from RDKit's parsed SMILES string from the corresponding Chem.Mol. Besides the SMILES strings being different between inputs and outputs, it also appears the ordering of atoms in the output molecules is different from the inputs as well. I am not aware of a quick workaround for this besides perhaps replacing OpenBabel with RDKit for input molecule loading during inference.

amorehead commented 3 months ago

I believe I have identified the issue. When having RFAA load a ligand chain as an SDF file, OpenBabel makes some unexpected modifications to the resulting molecular graph (e.g., adds hydrogens to the SMILES, permutes atoms, etc). However, when having RFAA load a ligand chain strictly as a SMILES string, then OpenBabel successfully maintains the integrity of the input molecular graph based on my experience with the following test case.

Input SMILES: OC1CCC(O)N(O)CCCCCNC(O)CCC(O)N(O)CCCCCN1 OpenBabel-parsed SMILES (from SMILES directly): OC1CCC(O)N(O)CCCCCNC(O)CCC(O)N(O)CCCCCN1\t\n OpenBabel-parsed SMILES (from SDF file): C1CCN[C@@H](CC[C@@H](N(CCCCCN[C@@H](CC[C@@H](N(CC1)O)O)O)O)O)O\t\n

Consequently, I am even noticing different predicted complex structures (protein and ligand) depending on whether the input ligand was parsed as a SMILES string or rather as an SDF file, as shown as follows.

Excerpt of ligand atom predictions for parsed SMILES input:

HETATM10367    C LG1 B 773      12.559  -1.503  -4.176  1.00  0.22           C
HETATM10368    C LG1 B 773      12.359  -2.663  -3.325  1.00  0.22           C
HETATM10369    C LG1 B 773      12.827  -4.047  -3.546  1.00  0.24           C
HETATM10370    C LG1 B 773      14.533  -4.666  -5.213  1.00  0.22           C
HETATM10371    C LG1 B 773      14.383  -4.224  -6.600  1.00  0.22           C
HETATM10372    C LG1 B 773      13.810  -3.500  -7.489  1.00  0.22           C
HETATM10373    C LG1 B 773      12.965  -2.668  -8.347  1.00  0.22           C
HETATM10374    C LG1 B 773      10.588  -1.855  -7.441  1.00  0.22           C
HETATM10375    C LG1 B 773      10.148  -1.023  -6.340  1.00  0.22           C
HETATM10376    C LG1 B 773       9.526  -1.371  -5.020  1.00  0.22           C
HETATM10377    C LG1 B 773       9.368  -1.205  -3.716  1.00  0.22           C
HETATM10378    C LG1 B 773       9.557  -0.845  -2.245  1.00  0.23           C
HETATM10379    C LG1 B 773      10.840   0.901  -0.730  1.00  0.21           C
HETATM10380    C LG1 B 773      11.847   1.916  -1.136  1.00  0.21           C
HETATM10381    C LG1 B 773      13.221   1.921  -1.399  1.00  0.21           C
HETATM10382    C LG1 B 773      13.949   2.528  -2.431  1.00  0.21           C
HETATM10383    C LG1 B 773      14.025   0.717  -4.362  1.00  0.22           C
HETATM10384    C LG1 B 773      13.230  -0.299  -3.774  1.00  0.21           C
HETATM10385    N LG1 B 773      13.445  -4.566  -4.470  1.00  0.23           N
HETATM10386    N LG1 B 773      11.790  -1.893  -8.199  1.00  0.22           N
HETATM10387    N LG1 B 773      10.199   0.233  -1.751  1.00  0.22           N

Excerpt of ligand atom predictions for parsed SDF input:

HETATM10367    C LG1 B 773      12.833  -3.713  -2.842  1.00  0.22           C
HETATM10368    C LG1 B 773      12.523  -4.387  -4.119  1.00  0.23           C
HETATM10369    C LG1 B 773      13.938  -4.789  -4.713  1.00  0.23           C
HETATM10370    C LG1 B 773      15.470  -3.908  -6.408  1.00  0.22           C
HETATM10371    C LG1 B 773      15.426  -3.196  -7.591  1.00  0.22           C
HETATM10372    C LG1 B 773      15.002  -2.326  -8.731  1.00  0.21           C
HETATM10373    C LG1 B 773      14.440  -1.534  -9.734  1.00  0.22           C
HETATM10374    C LG1 B 773      11.993  -2.104 -10.247  1.00  0.22           C
HETATM10375    C LG1 B 773      10.855  -1.852  -9.617  1.00  0.23           C
HETATM10376    C LG1 B 773      10.082  -2.272  -8.557  1.00  0.23           C
HETATM10377    C LG1 B 773      10.139  -1.879  -7.203  1.00  0.22           C
HETATM10378    C LG1 B 773       8.948  -1.460  -6.379  1.00  0.23           C
HETATM10379    C LG1 B 773       8.968   0.068  -4.405  1.00  0.22           C
HETATM10380    C LG1 B 773       9.996   0.917  -3.816  1.00  0.22           C
HETATM10381    C LG1 B 773      11.167   0.897  -3.102  1.00  0.22           C
HETATM10382    C LG1 B 773      11.416   0.363  -1.761  1.00  0.21           C
HETATM10383    C LG1 B 773      12.474  -1.727  -1.350  1.00  0.22           C
HETATM10384    C LG1 B 773      11.926  -2.754  -2.318  1.00  0.23           C
HETATM10385    N LG1 B 773      14.365  -4.196  -5.788  1.00  0.24           N
HETATM10386    N LG1 B 773      13.183  -1.668 -10.276  1.00  0.22           N
HETATM10387    N LG1 B 773       9.422  -0.518  -5.536  1.00  0.22           N

Is anyone aware of a solution to this issue with SDF file parsing with OpenBabel? It seems that the solution for now will be to only load ligands as SMILES strings. However, I believe this will cause issues for covalent modification predictions since that pipeline relies on SDF file inputs.

(Side note: It seems there is a documented history of issues with having OpenBabel correctly parse molecular graphs from coordinate input files: e.g., https://github.com/openbabel/openbabel/issues/1244)

baker-laboratory / RoseTTAFold-All-Atom

Output PDB file formatting potentially invalid for proper ligand parsing #54