LPDI-EPFL / masif

MaSIF- Molecular surface interaction fingerprints. Geometric deep learning to decipher patterns in molecular surfaces.
Apache License 2.0
582 stars 154 forks source link

Problems when extracting chains from PDB files #12

Closed jomimc closed 4 years ago

jomimc commented 4 years ago

Hi,

I was using "input_output.extractPDB" to extract PDB chains for my work and two problems popped up. Essentially the code can end up missing a few residues depending on the following:

1:

class NotDisordered(Select):
    def accept_atom(self, atom):
        return not atom.is_disordered() or atom.get_altloc() == "A" 

According to the comment on this class, it is supposed to exclude disordered atoms. However, this class actually is used to save disordered atoms. It appears to be a result of an error in the validation of the X-ray-derived structure (link), whereby two different configurations are given for a set of atoms. What this code is supposed to do is to choose one out of those two configurations ().

Typically the two configurations are labelled 'A' and 'B', however I have noticed that they can be labelled as '1' or '2'. When they are labelled '1' or '2', the class returns False -- the disordered residues are ignored. For my own work, I have used the following function to ensure that the residues are not ignored, even if they are labelled '1' or '2'.

class NotDisordered(Select):
    def accept_atom(self, atom):
        return not atom.is_disordered() or atom.get_altloc() == "A" or atom.get_altloc() == "1" 

2:

Modified / non-canonical amino acids are designated HETATM in the PDB, even though they are clearly amino acids. I added a function which reads the names of modified / non-canonical amino acids from the PDB SEQRES section, and notes the non-standard codes.

from Bio.SeqUtils import IUPACData
PROTEIN_LETTERS = [x.upper() for x in IUPACData.protein_letters_3to1.keys()]

def find_modified_amino_acids(path):
    res_set = set()
    for line in open(path, 'r'):
        if line[:6] == 'SEQRES':
            for res in line.split()[4:]:
                res_set.add(res)
    for res in list(res_set):
        if res in PROTEIN_LETTERS:
            res_set.remove(res)
    return res_set

def extractPDB(
    infilename, outfilename, chain_ids=None, includeWaters=False, invert=False
):
    # extract the chain_ids from infilename and save in outfilename. 
    # includeWaters: deprecated parameter, include the crystallographic waters (should not be used). 
    # invert: Select all chains EXCEPT those in chain_ids.
    parser = PDBParser(QUIET=True)
    struct = parser.get_structure(infilename, infilename)
    model = Selection.unfold_entities(struct, "M")[0]
    chains = Selection.unfold_entities(struct, "C")

    # Select residues to extract and build new structure
    structBuild = StructureBuilder.StructureBuilder()
    structBuild.init_structure("output")
    structBuild.init_seg(" ")
    structBuild.init_model(0)
    outputStruct = structBuild.get_structure()

    # Load a list of non-standard amino acid names -- these are
    # typically listed under HETATM, so they would be typically
    # ignored by the orginal algorithm
    modified_amino_acids = find_modified_amino_acids(infilename)

    for chain in model:
        if (
            chain_ids == None
            or (chain.get_id() in chain_ids and not invert)
            or invert == True
        ):
            structBuild.init_chain(chain.get_id())
            for residue in chain:
                het = residue.get_id()
                if not invert:
                    if het[0] == " " or (het[0] == "W" and includeWaters):
                        outputStruct[0][chain.get_id()].add(residue)
                    elif het[0][-3:] in modified_amino_acids:
                        print(het[0])
                        outputStruct[0][chain.get_id()].add(residue)
                else:
                    if (het[0] == "W" and includeWaters) or (
                        chain.get_id() not in chain_ids
                    ):
                        outputStruct[0][chain.get_id()].add(residue)
                    elif het[0][-3:] in modified_amino_acids:
                        outputStruct[0][chain.get_id()].add(residue)
                                                                                                                                                                                                 63,0-1        

    # Output the selected residues
    pdbio = PDBIO()
    pdbio.set_structure(outputStruct)
    pdbio.save(outfilename, select=NotDisordered())
pablogainza commented 4 years ago

Thank you jomimc!!! These are very useful fixes! I have uploaded them with proper credit. I'm excited to use this.