Issues concerning PDB input

KleinesMesser commented 3 years ago

In general, PDB2PQR, embedded in the get_protonation() function, can process raw PDB files, which will remove all redundant information, add missing atoms, and determine the protonation status at the same time. However, some cases will cause problems.

Case #1: Biological assembly is not equivalent to the asymmetric unit. (To find out what "biological assembly" and "asymmetric unit" are, go to http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies.) For example, 1PKX is supposed to have two chains whereas the file has four chains. Chain A and B are one biological assembly and Chain C and D are another biological assembly. Proposed solution: Download biological assemblies instead of the original PDB file as the initial input file.

Case #2: PDB2PQR sometimes deletes the terminal residues (1Q17 as an example). Especially at the N terminal, if the original terminal residue is deleted, PDB2PQR will not change the next one into a new N terminal residue. N terminal -NH3 hydrogens have the name of H1, H2, H3 whereas middle residue O=C-N-H hydrogen simply has the name of H, causing LEAP to have a fatal error as (in this case the terminal is a GLY): "FATAL: Atom .R<NGLY 239>.A<H 10> does not have a type." To resolve this issue, one should first figure out why PDB2PQR deletes the terminal residues. (Maybe related to the residue numbering? 1Q17 chains start indexing the residue from negative numbers. But PDB2PQR does not delete all residues with non-positive numbering.)

Case #3: Multiple residues at one residue site. For example res240 in 1E25: LYS/ALA/GLY/LYS. PDB2PQR will merge them as one "big residue" which will cause problems later in Leap.

Case #4: Original PDB files cause get_protonation() to crash directly. Haven't got a chance to look into the exact reasons. Examples: 1K20, 1NI9, 1WN1. For example, the 1k20.pdb will give these error messages:

Traceback (most recent call last): File "0_MAIN.py", line 21, in PDB1.get_protonation() #PDB2PQR has different atom order as leap File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_PDB.py", line 478, in get_protonation self._protonation_Fix(out_path, ph=ph) File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_PDB.py", line 522, in _protonation_Fix new_stru.protonation_metal_fix(Fix = 1) File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 540, in protonation_metal_fix metal.get_donor_residue(method = 'INC') File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 1598, in get_donor_residue self.get_donor_atom(method=method) File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 1588, in get_donor_atom if dist <= (R_d + R_m): TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'

shaoqx commented 3 years ago

Case 1: Wow, never found PDB can download biological assembly only. Great finding! We always have that stoichiometry problem ready to solve. This seems to be a good way. (important since this is the only way to get the author assigned pairing info) Case 2: This error seems not consistently happening.

[ ] I'll make a function removing negative residues. (as an option)

Case 3: for conformers, we keep the first one. (Update: PDB2PQR can take care of this) for insertions, we just ignore now.

[ ] Add debug info.

Case 4: This is the error when you find a new metal that is not included in the current radius map. We should record is metal and append the map. (I'll look this pdb up and do these in next update.)

KleinesMesser commented 3 years ago

Thanks, Qianzhen. For case 3, I think the conformers can be detected by pdb2pqr and the first one is retained. And in the pdb2pqr log file, there will be warnings like this (if it is a single-atom multiple occupancy):

_2021-02-23 11:48:23,944 WARNING:main.py:414:setupmolecule:Multiple occupancies found: CA in HIS B 140.

And later if all of the multiple occupancies in one residue have been detected:

_2021-02-23 11:48:23,945 WARNING:main.py:420:setupmolecule:Multiple occupancies found in HIS B 140. At least one of the instances is being ignored.

By default, only the first instance for each atom will be retained. So basically for this issue, we do not need to do anything unless we want the non-first occupancy.

I will look into the insertion today and update what I find ASAP.

Now we have collected quite some special cases for original PDB files. Although pdb2pqr is able to cope with most of the issues, I think we still should build a function to clean the raw PDB files and provide better input for pdb2pqr.

shaoqx commented 3 years ago

Cool! Looking forward to new results. Currently, if the conformer is good then no modification is needed before going into PDB2PQR right?

KleinesMesser commented 3 years ago

Yes. For multiple conformers (AKA multiple occupancies) we do not need to do anything. But maybe when we read in the original PDB files, it is better to output information concerning these situations: multiple occupancy (column 17 of PDB); insertion (col. 27); and additional terminal residues (non-positive residue numbering). This list may become longer later.

As for insertions, what do you mean by "ignore"? Just delete those inserted residues? I am considering renumbering these inserted residues so that then PDB2PQR will treat them as normal residues. But I agree with John's idea, it is better to find enzymes without insertions. The insertions can be seen as derivatives from the "mother enzyme". For example, 1E25 belongs to class A β-lactamase. So, my thought for insertions is that 1. Avoid them and find alternatives; 2. If we have to deal with them, renumber the inserted residues and all the following residues. Give out warnings when dealing with enzymes with insertions.

KleinesMesser commented 3 years ago

Some summaries about the effect of "TER", chain ID, and numbering in get_protonation(). 1. the Same numbering for different residues in one chain will cause problems no matter whether these residues are adjacent in PDB files or not. 2. Only deleting "TER" may cause the loss of ligand record in the output file of get_protonation(). Water molecules are not affected and the protein itself is not affected either. 3. Only deleting chain IDs do not cause a problem. (Only change the order of water molecules.) 4. If both "TER" and chain IDs are deleted, PDB2PQR won't be able to recogonize the chain seperation.

Take home message here: "TER" lines are more important than chain ID in get_protonation().

ChemBioHTP / EnzyHTP

Issues concerning PDB input #10