I was using "input_output.extractPDB" to extract PDB chains for my work and two problems popped up. Essentially the code can end up missing a few residues depending on the following:
1:
class NotDisordered(Select):
def accept_atom(self, atom):
return not atom.is_disordered() or atom.get_altloc() == "A"
According to the comment on this class, it is supposed to exclude disordered atoms. However, this class actually is used to save disordered atoms. It appears to be a result of an error in the validation of the X-ray-derived structure (link), whereby two different configurations are given for a set of atoms. What this code is supposed to do is to choose one out of those two configurations ().
Typically the two configurations are labelled 'A' and 'B', however I have noticed that they can be labelled as '1' or '2'. When they are labelled '1' or '2', the class returns False -- the disordered residues are ignored. For my own work, I have used the following function to ensure that the residues are not ignored, even if they are labelled '1' or '2'.
class NotDisordered(Select):
def accept_atom(self, atom):
return not atom.is_disordered() or atom.get_altloc() == "A" or atom.get_altloc() == "1"
2:
Modified / non-canonical amino acids are designated HETATM in the PDB, even though they are clearly amino acids. I added a function which reads the names of modified / non-canonical amino acids from the PDB SEQRES section, and notes the non-standard codes.
from Bio.SeqUtils import IUPACData
PROTEIN_LETTERS = [x.upper() for x in IUPACData.protein_letters_3to1.keys()]
def find_modified_amino_acids(path):
res_set = set()
for line in open(path, 'r'):
if line[:6] == 'SEQRES':
for res in line.split()[4:]:
res_set.add(res)
for res in list(res_set):
if res in PROTEIN_LETTERS:
res_set.remove(res)
return res_set
def extractPDB(
infilename, outfilename, chain_ids=None, includeWaters=False, invert=False
):
# extract the chain_ids from infilename and save in outfilename.
# includeWaters: deprecated parameter, include the crystallographic waters (should not be used).
# invert: Select all chains EXCEPT those in chain_ids.
parser = PDBParser(QUIET=True)
struct = parser.get_structure(infilename, infilename)
model = Selection.unfold_entities(struct, "M")[0]
chains = Selection.unfold_entities(struct, "C")
# Select residues to extract and build new structure
structBuild = StructureBuilder.StructureBuilder()
structBuild.init_structure("output")
structBuild.init_seg(" ")
structBuild.init_model(0)
outputStruct = structBuild.get_structure()
# Load a list of non-standard amino acid names -- these are
# typically listed under HETATM, so they would be typically
# ignored by the orginal algorithm
modified_amino_acids = find_modified_amino_acids(infilename)
for chain in model:
if (
chain_ids == None
or (chain.get_id() in chain_ids and not invert)
or invert == True
):
structBuild.init_chain(chain.get_id())
for residue in chain:
het = residue.get_id()
if not invert:
if het[0] == " " or (het[0] == "W" and includeWaters):
outputStruct[0][chain.get_id()].add(residue)
elif het[0][-3:] in modified_amino_acids:
print(het[0])
outputStruct[0][chain.get_id()].add(residue)
else:
if (het[0] == "W" and includeWaters) or (
chain.get_id() not in chain_ids
):
outputStruct[0][chain.get_id()].add(residue)
elif het[0][-3:] in modified_amino_acids:
outputStruct[0][chain.get_id()].add(residue)
63,0-1
# Output the selected residues
pdbio = PDBIO()
pdbio.set_structure(outputStruct)
pdbio.save(outfilename, select=NotDisordered())
Hi,
I was using "input_output.extractPDB" to extract PDB chains for my work and two problems popped up. Essentially the code can end up missing a few residues depending on the following:
1:
According to the comment on this class, it is supposed to exclude disordered atoms. However, this class actually is used to save disordered atoms. It appears to be a result of an error in the validation of the X-ray-derived structure (link), whereby two different configurations are given for a set of atoms. What this code is supposed to do is to choose one out of those two configurations ().
Typically the two configurations are labelled 'A' and 'B', however I have noticed that they can be labelled as '1' or '2'. When they are labelled '1' or '2', the class returns False -- the disordered residues are ignored. For my own work, I have used the following function to ensure that the residues are not ignored, even if they are labelled '1' or '2'.
2:
Modified / non-canonical amino acids are designated HETATM in the PDB, even though they are clearly amino acids. I added a function which reads the names of modified / non-canonical amino acids from the PDB SEQRES section, and notes the non-standard codes.