A bug in the preparation.protonate._fix_pybel_output and its potential cause

Dear colleagues,

Recently I'm faced with a disastrous bug when protonating the protein-ligand complex, where the protonated ligand area will become a mess in the protonated complex.

After testing, it's found that if the original ligand is without H, this bug will not be caused; however, if the original ligand is with H, this bug will probably happen, where the output PDB file is misaligned in its mapping between atom names and atom types.

COMPND    .../bin/EnzyHTP/test/preparation/data/ligand_test_4WI.pdb    
AUTHOR    GENERATED BY OPEN BABEL 3.1.0                                         
ATOM      1  C5  4WI B 307      37.258  63.990  48.016  1.00  0.00           C  
ATOM      2  H1  4WI B 307      37.779  64.177  46.592  1.00  0.00           C  
ATOM      3  H2  4WI B 307      38.256  62.831  46.314  1.00  0.00           N  
ATOM      4  C6  4WI B 307      38.594  62.128  47.436  1.00  0.00           C  
ATOM      5  H3  4WI B 307      39.053  61.008  47.452  1.00  0.00           O  
ATOM      6  H4  4WI B 307      38.170  62.938  48.648  1.00  0.00           C  
ATOM      7  N1  4WI B 307      37.442  62.096  49.719  1.00  0.00           C  
ATOM      8  H31 4WI B 307      37.057  62.954  50.940  1.00  0.00           C  
...

I looked into the preparation.protonate.pybel_protonate_pdb_ligand function (which is to protonate a ligand with pybel), noticing that when preparation.protonate._fix_pybel_output is called by preparation.protonate.pybel_protonate_pdb_ligand, the path of original ligand file is passed via its formal parameter ref_name_path.

# PYBEL interface
def pybel_protonate_pdb_ligand(in_path: str, out_path: str, ph: float = 7.0) -> str:
    """
    This is a wrapper of ...

    Args:
        in_path: path of input ligand pdb for protonation
        out_path: path of output result protonated ligand pdb
        ph: target pH
    Return:
        (write file to out_path)
        return the {out_path} from input
    """
    int_path = fs.get_valid_temp_name(out_path.removesuffix(".pdb") + "_badname.pdb")

    pybel.ob.obErrorLog.SetOutputLevel(0)
    mol = next(pybel.readfile("pdb", in_path))
    mol.OBMol.AddHydrogens(False, True, ph)
    mol.write("pdb", int_path, overwrite=True)
    # fix atom label and residue name
    _fix_pybel_output(int_path, out_path, in_path)

    fs.clean_temp_file_n_dir([int_path])
    return out_path

When pybel is performing the protonation in the step mol.OBMol.AddHydrogens(False, True, ph) on a ligand containing hydrogen atoms, all the existing hydrogen atoms will be removed first, and then re-add hydrogen atoms, written after the heavy atoms in the PDB file, without deduplicating hydrogen atom names (or all atom names in some earlier versions), where a _badname.pdb file will be generated as an intermediate file.

COMPND    .../bin/EnzyHTP/test/preparation/data/ligand_test_4WI.pdb 
AUTHOR    GENERATED BY OPEN BABEL 3.1.0
ATOM      1  C5  4WI B 307      37.258  63.990  48.016  1.00  0.00           C  
ATOM      2  C6  4WI B 307      37.779  64.177  46.592  1.00  0.00           C  
ATOM      3  N1  4WI B 307      38.256  62.831  46.314  1.00  0.00           N  
ATOM      4  C   4WI B 307      38.594  62.128  47.436  1.00  0.00           C  
ATOM      5  O   4WI B 307      39.053  61.008  47.452  1.00  0.00           O  
...
ATOM     33  F1  4WI B 307      47.032  65.486  51.289  1.00  0.00           F  
ATOM     34  F2  4WI B 307      46.806  65.611  49.127  1.00  0.00           F  
ATOM     35  F   4WI B 307      45.771  64.061  50.247  1.00  0.00           F  
ATOM     36  H   4WI B 307      36.245  63.647  48.002  1.00  0.00           H  
ATOM     37  H   4WI B 307      37.268  64.907  48.567  1.00  0.00           H  
ATOM     38  H   4WI B 307      36.999  64.468  45.920  1.00  0.00           H  
ATOM     39  H   4WI B 307      38.515  64.946  46.480  1.00  0.00           H  
ATOM     40  H   4WI B 307      38.328  62.462  45.404  1.00  0.00           H  
...

Thus, the _fix_pybel_output function is introduced to assign unique name to each atom. When None is passed to its formal parameter ref_name_path, the atoms will be assigned in the default order of names; otherwise, the atom names in the PDB file are assigned to the newly generated atoms based on a mapping with the atom names in the original ligand file and the atom names in the new file (_badname.pdb).

Two candidate solutions are proposed here:

The new version of pybel (openbabel=3.1.1=py39h421517d_8 in conda) does not change the atomic name of the heavy atom when protonating the ligand, so I tested calling the _fix_pybel_output function with the formal parameter ref_name_path set to None, the protonated output is not abnormal, regardless of whether the original ligand contains hydrogen or not. Therefore, it is possible to consider not using the original ligand file as a template for fixing the atom name when installing this Pybel version as a dependency.

An alternative strategy is also desirable if compatibility issues and the unpredictability of Pybel's behavior in future versions are taken into account:

Make it mandatory for ligands containing hydrogen atoms to undergo a dehydrogenation step before being protonated by Pybel, thus allowing us to bypass the mismatch problem that occurs when the _fix_pybel_output function uses the original ligand file containing the hydrogen atoms as a template for fixing the atom name.

ChemBioHTP / EnzyHTP

A bug in the preparation.protonate._fix_pybel_output and its potential cause #156