Retrieving bond information from PDB files

JacobAnter commented 2 years ago

Hello,

in some cases, the PDBFile seems to have difficulties retrieving bond information from PDB files. To be more precise, this occurred when trying to read a PDB file returned from the ClusPro web server for protein-protein docking (https://cluspro.org/login.php). The PDB file is attached at the end of the issue.

To be more precise, I tried to execute the following code snippet:

from biotite.structure.io.pdb import PDBFile

suspicious_pdb_file = PDBFile.read("docking_pose_GR_51_balanced.01.pdb")
suspicious_atom_array = suspicious_pdb_file.get_structure(
    model=1,
    include_bonds=True
)

This, however, resulted in an IndexError: IndexError: index 2522 is out of bounds for axis 0 with size 2522

Surprisingly, however, it is possible to load this PDB file into PyMol via the GUI and to retain the bond information; alpha-helices and beta-sheets are correctly depicted. This might indicate that the respective PDB file belongs to some edge case Biotite does not cover yet. The assumption that the provided PDB file differs from "conventional" ones is also due to a comparison to PDB files obtained from the Protein Data Bank: In the file for structure 7krj, for example, representing a ternary complex, the three proteins comprised in this complex (Hsp90, P23 and Glucocorticoid receptor) are not separated into different structures, or, in other words, the keyword "END" only appears once, at the end. In case of the PDB file returned by the ClusPro web server, however, receptor and ligand indeed are separated into to different structures, as indicated by the fact that the keyword "END" occurs twice.

If this indeed is the case, it would be great if you could expand the code so that Biotite is capable of retrieving bond information from this edge case of PDB files.

Thanks in advance.

Regards

docking_pose_GR_51_balanced.01.docx

claudejrogers commented 2 years ago

PDB files have different "record" types indicated by the first 6 characters in each line (e.g., ATOM, HETATM, CONECT). Bond information is only necessary for HETATM records in the PDB with corresponding CONECT entries. The CONECT entries describe the bonds between HETATM records, but aren't necessary for ATOM records since it is known how amino acids are bonded together. If your structure only has protein atom records (lines starting with ATOM), you can set include_bonds=False. Different proteins are differentiated by the chain ID and would be separated by a TER record, which are ignored when parsing the file (as is END).

The problem with your file is that it restarts the atom_id numbers between the two structures, which biotite doesn't expect.

padix-key commented 2 years ago

I agree with @claudejrogers: Biotite and the PDB standard expects that the atom IDs are unique, which is not the case for your file. However, I think an InvalidFileError would be more descriptive here. Hence, I will keep this issue open.

To make your example work. you can calculate the bonds afterwards. Since the file does not contain any CONECT records anyway and PDBFile uses connect_via_residue_names() under the hood, the result should be the same to include_bonds=True (if it worked in your case).

from biotite.structure.io.pdb import PDBFile
import biotite.structure as struc

suspicious_pdb_file = PDBFile.read("/home/kunzmann/downloads/test.pdb")
atom_array = suspicious_pdb_file.get_structure(model=1)
atom_array.bonds = struc.connect_via_residue_names(atom_array)

claudejrogers commented 2 years ago

@jacob-no-cop if your docking program produces several ligand poses for a fixed receptor, it might be worthwhile to split the entries. Code for that would look like:

from pathlib import Path
from contextlib import ExitStack, suppress

def split_cluspro_pdb(path):
    p = Path(path)
    parent = p.parent
    stem = p.stem
    suffix = p.suffix
    outfiles = [
        str(parent / f"{stem}.{mol}{suffix}") for mol in ("receptor", "ligand")
    ]
    with open(path) as f, ExitStack() as stack:
        files = (stack.enter_context(open(path, "w")) for path in outfiles)
        out = next(files)
        for line in f:
            out.write(line)
            if line.startswith("END"):
                with suppress(StopIteration):
                    # ignore StopIteration after ligand
                    out = next(files

This would read a file at some path, like "docking_pose_GR_51_balanced.01.pdb" and save "docking_pose_GR_51_balanced.receptor.01.pdb" and "docking_pose_GR_51_balanced.01.ligand.pdb" in the same directory.

JacobAnter commented 2 years ago

Thanks to both of you, @padix-key and @claudejrogers. So, if I understood you correctly, the atom numbering in my PDB file not conforming to the PDB standard only raises a problem when trying to retrieve the CONECT records, but not when trying to add amino acid bonds via connect.via.residue_names(), isn't it?

With regard to your suggestion of splitting the PDB file into individual files for the receptor and ligand, @claudejrogers, I already pursued a less sophisticated approach to do so. To be more precise, I created the separate file via the following code snippet:

from biotite.structure.io.pdb import  PDBFile
import ammolite

cluspro_pdb_file = PDBFile.read("docking_pose_GR_51_balanced.01.pdb")

with open("docking_pose_GR_51_balanced.01.pdb"", "r") as f:
    line_list = f.readlines()
    idx_end_rec = line_list.index("END\n")
    list_rec = line_list[:idx_end_rec + 1]
    list_lig = line_list[idx_end_rec + 1:]

# The immunophilin, i. e. FKBP is defined as receptor,
# whereas the GR is defined as ligand
with open("docking_pose_51_rec.pdb", "w") as f:
    f.write("".join(list_rec))

with open("docking_pose_51_lig.pdb", "w") as f:
    f.write("".join(list_lig))

This, however, has not proven to be an efficient workaround as in a subsequent step, I loaded the two separate structures in PyMol via Ammolite (another package developed by @padix-key built upon biotite) and merged them into one object:

# Load the selected docking poses into PyMol and take a picture
pdb_file_docking_51_rec = PDBFile.read(
    "docking_pose_51_rec.pdb"
)
atom_array_docking_51_rec = pdb_file_docking_51_rec.get_structure(
    model=1,
    include_bonds=True
)
pymol_docking_pose_51_rec = ammolite.PyMOLObject.from_structure(
    atom_array_docking_51_rec,
    name="docking_pose_51_rec"
)

pdb_file_docking_51_lig = PDBFile.read(
    "docking_pose_51_lig.pdb"
)
atom_array_docking_51_lig = pdb_file_docking_51_lig.get_structure(
    model=1,
    include_bonds=True
)
pymol_docking_pose_51_lig = ammolite.PyMOLObject.from_structure(
    atom_array_docking_51_lig,
    name="docking_pose_51_lig"
)

# Merge the GR and FKBP51 and delete the excess objects
# afterwards in order to enable joint rotation in the 
# course of alignment
ammolite.cmd.create(
    name="docking_pose_GR_with_51",
    selection="docking_pose_51_rec | docking_pose_51_lig"
)
pymol_docking_pose_51 = ammolite.PyMOLObject(
    name="docking_pose_GR_with_51"
)

ammolite.cmd.delete(
    name="docking_pose_51_lig"
)
ammolite.cmd.delete(
    name="docking_pose_51_rec"
)

ammolite.cmd.zoom()
# The ``ammolite.show()`` command can only be executed in Jupyter Notebooks
ammolite.show()

Doing this, the resulting merged object is not displayed properly any more. Again, using the PyMol GUI does not lead to this problem.

JacobAnter commented 2 years ago

A have an update regarding the approach of splitting the entries: This approach works only if bonds are added afterwards as suggested by @padix-key. If the two structures are then loaded into PyMol via ammolite and merged into one object, it is displayed properly.

JacobAnter commented 2 years ago

It's again me; pursuing the approach of splitting the entries and adding the bonds via connect_via_residue_names does not work either.

padix-key commented 2 years ago

@jacob-no-cop, I tested your code and also saw that one of the chains looks very fragmented in PyMOL. However, this seems to be a problem of PyMOL's create command. When you use ammolite and you want both chains as joint object, it is easier to create a joint AtomArray in Biotite and to transfer this one to PyMOL. In my test the following snippet worked:

from biotite.structure.io.pdb import  PDBFile
import ammolite

with open("/home/kunzmann/downloads/test.pdb", "r") as f:
    line_list = f.readlines()
    idx_end_rec = line_list.index("END\n")
    list_rec = line_list[:idx_end_rec + 1]
    list_lig = line_list[idx_end_rec + 1:]

# The immunophilin, i. e. FKBP is defined as receptor,
# whereas the GR is defined as ligand
with open("docking_pose_51_rec.pdb", "w") as f:
    f.write("".join(list_rec))

with open("docking_pose_51_lig.pdb", "w") as f:
    f.write("".join(list_lig))

# Load the selected docking poses into PyMol and take a picture
pdb_file_docking_51_rec = PDBFile.read(
    "docking_pose_51_rec.pdb"
)
atom_array_docking_51_rec = pdb_file_docking_51_rec.get_structure(
    model=1,
    include_bonds=True
)

pdb_file_docking_51_lig = PDBFile.read(
    "docking_pose_51_lig.pdb"
)
atom_array_docking_51_lig = pdb_file_docking_51_lig.get_structure(
    model=1,
    include_bonds=True
)

atoms = atom_array_docking_51_rec + atom_array_docking_51_lig

ammolite.launch_interactive_pymol()
pymol_object = ammolite.PyMOLObject.from_structure(atoms)
pymol_object.color("green", atoms.chain_id == "C")
pymol_object.color("blue", atoms.chain_id == "D")
pymol_object.zoom()

JacobAnter commented 2 years ago

Thanks @padix-key , I finally solved the issue by loading the PDB file comprising both structures, adding bonds in an additional step via connect_via_residue_names and transferring this AtomArray to PyMol via Ammolite.

biotite-dev / biotite

Retrieving bond information from PDB files #379