gcorso / DiffDock

Implementation of DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
https://arxiv.org/abs/2210.01776
MIT License
976 stars 238 forks source link

DiffDock v1.1 cannot read Amber residue names for the histidine amino acid residue #190

Open polo9719 opened 4 months ago

polo9719 commented 4 months ago

Amber can give different names to histidine amino acid by examining which protons are present : HID, HIE, or HIP instead of HIS.

This raises an issue when featurizing the protein in Diffdock because those residues are matched to the one letter name X instead of H.

https://github.com/gcorso/DiffDock/blob/d3791a885e504ea7d7c3587951e259e338e4808b/datasets/constants.py#L3

It can be easily fixed by modifying all HID, HIE and HIP by HIS. Is it a good way to fix it ? If it is the case, may be it could be done automatically in the inference code. Otherwise, is there a way to read the PDB file that takes into account those variants of amino acids ?

PS-1 : When running DiffDock v1 on the same protein, everything is running fine. That's why I suspect the match of those modified histidines to X coming from the new package Prody.

PS-2 : I had this issue specifically with histidine, but may be it also happens with others amino acids ?

polo9719 commented 3 months ago

FYI I added this pre-processing script to fix the issue

import argparse
from Bio.PDB import PDBParser, PDBIO

# Define a mapping based on your table
residue_renaming_map = {
    'HID': 'HIS',
    'HIE': 'HIS',
    'HIP': 'HIS',
    'GLH': 'GLU',
    'ASH': 'ASP',
    'CYM': 'CYS',
    'CYX': 'CYS',
    'LYN': 'LYS',
}

def rename_residues(input_filename, output_filename):
    parser = PDBParser()
    structure = parser.get_structure("structure", input_filename)

    for model in structure:
        for chain in model:
            for residue in chain:
                # Get the standard residue name if it needs to be renamed
                standard_res_name = residue_renaming_map.get(residue.get_resname())
                if standard_res_name:
                    residue.resname = standard_res_name
                # Handle N-terminal and C-terminal residues
                elif residue.get_resname().startswith("N"):
                    residue.resname = residue.get_resname()[1:]
                elif residue.get_resname().endswith("C"):
                    residue.resname = residue.get_resname()[:-1]

    io = PDBIO()
    io.set_structure(structure)
    io.save(output_filename)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("input_file", type=str)
    parser.add_argument("output_file", type=str)

    args = parser.parse_args()

    rename_residues(
        args.input_file,
        args.output_file
    )