Open amorehead opened 3 months ago
In case anyone else needs a solution to this issue, one can (for the time being) use the following BioPandas function to extract all hetero-atoms uniquely for each ligand "residue".
import numpy as np
from biopandas.pdb import PandasPdb
def insert_unique_ligand_atom_name(input_pdb_file: str, output_pdb_file: str):
"""
Add unique identifier to ligand atoms with the same residue ID and chain ID.
:param input_pdb_file: Path to the input PDB file.
:param output_pdb_file: Path to the output PDB file.
"""
pdb = PandasPdb().read_pdb(input_pdb_file)
ligand_atoms = pdb.df["HETATM"]
ligand_atoms["atom_name"] = ligand_atoms["atom_name"] + np.arange(1, len(ligand_atoms) + 1).astype(str)
pdb.df["HETATM"] = ligand_atoms
pdb.to_pdb(output_pdb_file)
Related to this issue, I am having difficulty parsing the predicted ligand structures back into the original (e.g., RDKit) conformer for each input SMILES molecule. It seems the inference code uses OpenBabel to read each input SMILES string/SDF file, and the SMILES strings parsed from each pybel.Molecule(obmol)
object (via pybel.Molecule(obmol).write("smi")
) differ from RDKit's parsed SMILES string from the corresponding Chem.Mol
. Besides the SMILES strings being different between inputs and outputs, it also appears the ordering of atoms in the output molecules is different from the inputs as well. I am not aware of a quick workaround for this besides perhaps replacing OpenBabel with RDKit for input molecule loading during inference.
I believe I have identified the issue. When having RFAA load a ligand chain as an SDF file, OpenBabel makes some unexpected modifications to the resulting molecular graph (e.g., adds hydrogens to the SMILES, permutes atoms, etc). However, when having RFAA load a ligand chain strictly as a SMILES string, then OpenBabel successfully maintains the integrity of the input molecular graph based on my experience with the following test case.
Input SMILES: OC1CCC(O)N(O)CCCCCNC(O)CCC(O)N(O)CCCCCN1
OpenBabel-parsed SMILES (from SMILES directly): OC1CCC(O)N(O)CCCCCNC(O)CCC(O)N(O)CCCCCN1\t\n
OpenBabel-parsed SMILES (from SDF file): C1CCN[C@@H](CC[C@@H](N(CCCCCN[C@@H](CC[C@@H](N(CC1)O)O)O)O)O)O\t\n
Consequently, I am even noticing different predicted complex structures (protein and ligand) depending on whether the input ligand was parsed as a SMILES string or rather as an SDF file, as shown as follows.
Excerpt of ligand atom predictions for parsed SMILES input:
HETATM10367 C LG1 B 773 12.559 -1.503 -4.176 1.00 0.22 C
HETATM10368 C LG1 B 773 12.359 -2.663 -3.325 1.00 0.22 C
HETATM10369 C LG1 B 773 12.827 -4.047 -3.546 1.00 0.24 C
HETATM10370 C LG1 B 773 14.533 -4.666 -5.213 1.00 0.22 C
HETATM10371 C LG1 B 773 14.383 -4.224 -6.600 1.00 0.22 C
HETATM10372 C LG1 B 773 13.810 -3.500 -7.489 1.00 0.22 C
HETATM10373 C LG1 B 773 12.965 -2.668 -8.347 1.00 0.22 C
HETATM10374 C LG1 B 773 10.588 -1.855 -7.441 1.00 0.22 C
HETATM10375 C LG1 B 773 10.148 -1.023 -6.340 1.00 0.22 C
HETATM10376 C LG1 B 773 9.526 -1.371 -5.020 1.00 0.22 C
HETATM10377 C LG1 B 773 9.368 -1.205 -3.716 1.00 0.22 C
HETATM10378 C LG1 B 773 9.557 -0.845 -2.245 1.00 0.23 C
HETATM10379 C LG1 B 773 10.840 0.901 -0.730 1.00 0.21 C
HETATM10380 C LG1 B 773 11.847 1.916 -1.136 1.00 0.21 C
HETATM10381 C LG1 B 773 13.221 1.921 -1.399 1.00 0.21 C
HETATM10382 C LG1 B 773 13.949 2.528 -2.431 1.00 0.21 C
HETATM10383 C LG1 B 773 14.025 0.717 -4.362 1.00 0.22 C
HETATM10384 C LG1 B 773 13.230 -0.299 -3.774 1.00 0.21 C
HETATM10385 N LG1 B 773 13.445 -4.566 -4.470 1.00 0.23 N
HETATM10386 N LG1 B 773 11.790 -1.893 -8.199 1.00 0.22 N
HETATM10387 N LG1 B 773 10.199 0.233 -1.751 1.00 0.22 N
Excerpt of ligand atom predictions for parsed SDF input:
HETATM10367 C LG1 B 773 12.833 -3.713 -2.842 1.00 0.22 C
HETATM10368 C LG1 B 773 12.523 -4.387 -4.119 1.00 0.23 C
HETATM10369 C LG1 B 773 13.938 -4.789 -4.713 1.00 0.23 C
HETATM10370 C LG1 B 773 15.470 -3.908 -6.408 1.00 0.22 C
HETATM10371 C LG1 B 773 15.426 -3.196 -7.591 1.00 0.22 C
HETATM10372 C LG1 B 773 15.002 -2.326 -8.731 1.00 0.21 C
HETATM10373 C LG1 B 773 14.440 -1.534 -9.734 1.00 0.22 C
HETATM10374 C LG1 B 773 11.993 -2.104 -10.247 1.00 0.22 C
HETATM10375 C LG1 B 773 10.855 -1.852 -9.617 1.00 0.23 C
HETATM10376 C LG1 B 773 10.082 -2.272 -8.557 1.00 0.23 C
HETATM10377 C LG1 B 773 10.139 -1.879 -7.203 1.00 0.22 C
HETATM10378 C LG1 B 773 8.948 -1.460 -6.379 1.00 0.23 C
HETATM10379 C LG1 B 773 8.968 0.068 -4.405 1.00 0.22 C
HETATM10380 C LG1 B 773 9.996 0.917 -3.816 1.00 0.22 C
HETATM10381 C LG1 B 773 11.167 0.897 -3.102 1.00 0.22 C
HETATM10382 C LG1 B 773 11.416 0.363 -1.761 1.00 0.21 C
HETATM10383 C LG1 B 773 12.474 -1.727 -1.350 1.00 0.22 C
HETATM10384 C LG1 B 773 11.926 -2.754 -2.318 1.00 0.23 C
HETATM10385 N LG1 B 773 14.365 -4.196 -5.788 1.00 0.24 N
HETATM10386 N LG1 B 773 13.183 -1.668 -10.276 1.00 0.22 N
HETATM10387 N LG1 B 773 9.422 -0.518 -5.536 1.00 0.22 N
Is anyone aware of a solution to this issue with SDF file parsing with OpenBabel? It seems that the solution for now will be to only load ligands as SMILES strings. However, I believe this will cause issues for covalent modification predictions since that pipeline relies on SDF file inputs.
(Side note: It seems there is a documented history of issues with having OpenBabel correctly parse molecular graphs from coordinate input files: e.g., https://github.com/openbabel/openbabel/issues/1244)
Hello.
After inspecting the formatting of RFAA's structure prediction PDB files for (multi-ligand) protein-ligand complexes, it seems that the formatting for ligand (HETATM) atoms associated with each ligand "residue" (in PDB verbiage) is not compatible with well-known Python PDB file parsing libraries such as BioPython. In particular, it looks like this is because RFAA's output PDB files (for ligand predictions) do not follow PDB file formatting standards (as discussed in https://stackoverflow.com/questions/64032005/is-there-a-way-to-obtain-a-full-list-of-atoms-from-a-ligand-residue).
Is there by chance a simple modification one can make to the RFAA source code to assign unique (alt loc) IDs to each ligand atom (with respect to a given residue) so that the output PDB files can be properly parsed?
Example PDB file output showing "duplicated" ligand atoms for a particular ligand "residue" (e.g., LG1, Chain B):