Closed JacobAnter closed 2 years ago
PDB files have different "record" types indicated by the first 6 characters in each line (e.g., ATOM
, HETATM
, CONECT
). Bond information is only necessary for HETATM records in the PDB with corresponding CONECT entries. The CONECT entries describe the bonds between HETATM records, but aren't necessary for ATOM records since it is known how amino acids are bonded together. If your structure only has protein atom records (lines starting with ATOM), you can set include_bonds=False
. Different proteins are differentiated by the chain ID and would be separated by a TER record, which are ignored when parsing the file (as is END).
The problem with your file is that it restarts the atom_id numbers between the two structures, which biotite doesn't expect.
I agree with @claudejrogers: Biotite and the PDB standard expects that the atom IDs are unique, which is not the case for your file. However, I think an InvalidFileError
would be more descriptive here. Hence, I will keep this issue open.
To make your example work. you can calculate the bonds afterwards. Since the file does not contain any CONECT
records anyway and PDBFile
uses connect_via_residue_names()
under the hood, the result should be the same to include_bonds=True
(if it worked in your case).
from biotite.structure.io.pdb import PDBFile
import biotite.structure as struc
suspicious_pdb_file = PDBFile.read("/home/kunzmann/downloads/test.pdb")
atom_array = suspicious_pdb_file.get_structure(model=1)
atom_array.bonds = struc.connect_via_residue_names(atom_array)
@jacob-no-cop if your docking program produces several ligand poses for a fixed receptor, it might be worthwhile to split the entries. Code for that would look like:
from pathlib import Path
from contextlib import ExitStack, suppress
def split_cluspro_pdb(path):
p = Path(path)
parent = p.parent
stem = p.stem
suffix = p.suffix
outfiles = [
str(parent / f"{stem}.{mol}{suffix}") for mol in ("receptor", "ligand")
]
with open(path) as f, ExitStack() as stack:
files = (stack.enter_context(open(path, "w")) for path in outfiles)
out = next(files)
for line in f:
out.write(line)
if line.startswith("END"):
with suppress(StopIteration):
# ignore StopIteration after ligand
out = next(files
This would read a file at some path, like "docking_pose_GR_51_balanced.01.pdb"
and save "docking_pose_GR_51_balanced.receptor.01.pdb"
and "docking_pose_GR_51_balanced.01.ligand.pdb"
in the same directory.
Thanks to both of you, @padix-key and @claudejrogers. So, if I understood you correctly, the atom numbering in my PDB file not conforming to the PDB standard only raises a problem when trying to retrieve the CONECT
records, but not when trying to add amino acid bonds via connect.via.residue_names()
, isn't it?
With regard to your suggestion of splitting the PDB file into individual files for the receptor and ligand, @claudejrogers, I already pursued a less sophisticated approach to do so. To be more precise, I created the separate file via the following code snippet:
from biotite.structure.io.pdb import PDBFile
import ammolite
cluspro_pdb_file = PDBFile.read("docking_pose_GR_51_balanced.01.pdb")
with open("docking_pose_GR_51_balanced.01.pdb"", "r") as f:
line_list = f.readlines()
idx_end_rec = line_list.index("END\n")
list_rec = line_list[:idx_end_rec + 1]
list_lig = line_list[idx_end_rec + 1:]
# The immunophilin, i. e. FKBP is defined as receptor,
# whereas the GR is defined as ligand
with open("docking_pose_51_rec.pdb", "w") as f:
f.write("".join(list_rec))
with open("docking_pose_51_lig.pdb", "w") as f:
f.write("".join(list_lig))
This, however, has not proven to be an efficient workaround as in a subsequent step, I loaded the two separate structures in PyMol via Ammolite (another package developed by @padix-key built upon biotite) and merged them into one object:
# Load the selected docking poses into PyMol and take a picture
pdb_file_docking_51_rec = PDBFile.read(
"docking_pose_51_rec.pdb"
)
atom_array_docking_51_rec = pdb_file_docking_51_rec.get_structure(
model=1,
include_bonds=True
)
pymol_docking_pose_51_rec = ammolite.PyMOLObject.from_structure(
atom_array_docking_51_rec,
name="docking_pose_51_rec"
)
pdb_file_docking_51_lig = PDBFile.read(
"docking_pose_51_lig.pdb"
)
atom_array_docking_51_lig = pdb_file_docking_51_lig.get_structure(
model=1,
include_bonds=True
)
pymol_docking_pose_51_lig = ammolite.PyMOLObject.from_structure(
atom_array_docking_51_lig,
name="docking_pose_51_lig"
)
# Merge the GR and FKBP51 and delete the excess objects
# afterwards in order to enable joint rotation in the
# course of alignment
ammolite.cmd.create(
name="docking_pose_GR_with_51",
selection="docking_pose_51_rec | docking_pose_51_lig"
)
pymol_docking_pose_51 = ammolite.PyMOLObject(
name="docking_pose_GR_with_51"
)
ammolite.cmd.delete(
name="docking_pose_51_lig"
)
ammolite.cmd.delete(
name="docking_pose_51_rec"
)
ammolite.cmd.zoom()
# The ``ammolite.show()`` command can only be executed in Jupyter Notebooks
ammolite.show()
Doing this, the resulting merged object is not displayed properly any more. Again, using the PyMol GUI does not lead to this problem.
A have an update regarding the approach of splitting the entries: This approach works only if bonds are added afterwards as suggested by @padix-key. If the two structures are then loaded into PyMol via ammolite and merged into one object, it is displayed properly.
It's again me; pursuing the approach of splitting the entries and adding the bonds via connect_via_residue_names
does not work either.
@jacob-no-cop, I tested your code and also saw that one of the chains looks very fragmented in PyMOL. However, this seems to be a problem of PyMOL's create
command. When you use ammolite and you want both chains as joint object, it is easier to create a joint AtomArray
in Biotite and to transfer this one to PyMOL. In my test the following snippet worked:
from biotite.structure.io.pdb import PDBFile
import ammolite
with open("/home/kunzmann/downloads/test.pdb", "r") as f:
line_list = f.readlines()
idx_end_rec = line_list.index("END\n")
list_rec = line_list[:idx_end_rec + 1]
list_lig = line_list[idx_end_rec + 1:]
# The immunophilin, i. e. FKBP is defined as receptor,
# whereas the GR is defined as ligand
with open("docking_pose_51_rec.pdb", "w") as f:
f.write("".join(list_rec))
with open("docking_pose_51_lig.pdb", "w") as f:
f.write("".join(list_lig))
# Load the selected docking poses into PyMol and take a picture
pdb_file_docking_51_rec = PDBFile.read(
"docking_pose_51_rec.pdb"
)
atom_array_docking_51_rec = pdb_file_docking_51_rec.get_structure(
model=1,
include_bonds=True
)
pdb_file_docking_51_lig = PDBFile.read(
"docking_pose_51_lig.pdb"
)
atom_array_docking_51_lig = pdb_file_docking_51_lig.get_structure(
model=1,
include_bonds=True
)
atoms = atom_array_docking_51_rec + atom_array_docking_51_lig
ammolite.launch_interactive_pymol()
pymol_object = ammolite.PyMOLObject.from_structure(atoms)
pymol_object.color("green", atoms.chain_id == "C")
pymol_object.color("blue", atoms.chain_id == "D")
pymol_object.zoom()
Thanks @padix-key , I finally solved the issue by loading the PDB file comprising both structures, adding bonds in an additional step via connect_via_residue_names
and transferring this AtomArray
to PyMol via Ammolite.
Hello,
in some cases, the
PDBFile
seems to have difficulties retrieving bond information from PDB files. To be more precise, this occurred when trying to read a PDB file returned from the ClusPro web server for protein-protein docking (https://cluspro.org/login.php). The PDB file is attached at the end of the issue.To be more precise, I tried to execute the following code snippet:
This, however, resulted in an IndexError:
IndexError: index 2522 is out of bounds for axis 0 with size 2522
Surprisingly, however, it is possible to load this PDB file into PyMol via the GUI and to retain the bond information; alpha-helices and beta-sheets are correctly depicted. This might indicate that the respective PDB file belongs to some edge case Biotite does not cover yet. The assumption that the provided PDB file differs from "conventional" ones is also due to a comparison to PDB files obtained from the Protein Data Bank: In the file for structure 7krj, for example, representing a ternary complex, the three proteins comprised in this complex (Hsp90, P23 and Glucocorticoid receptor) are not separated into different structures, or, in other words, the keyword "END" only appears once, at the end. In case of the PDB file returned by the ClusPro web server, however, receptor and ligand indeed are separated into to different structures, as indicated by the fact that the keyword "END" occurs twice.
If this indeed is the case, it would be great if you could expand the code so that Biotite is capable of retrieving bond information from this edge case of PDB files.
Thanks in advance.
Regards
docking_pose_GR_51_balanced.01.docx