Failing to parse 5MLU - Githubissues

sobolevnrm commented 3 years ago

A user alerted me to this issue via email. PDB2PQR is unable to parse 5MLU despite the fact that this appears to be a high-resolution structure with most atoms in place.

There may be two problems PDB2PQR is encountering. The first (non-fatal) issue seems to be related to incorrect parsing of the ANISOU fields in the PDB, resulting in PDB2PQR detecting multiple atoms:

2021-09-06 15:18:23,157 WARNING:biomolecule.py:1001:repair_heavy:Extra atom OP1 in DA I -72! - 
2021-09-06 15:18:23,157 WARNING:biomolecule.py:1003:repair_heavy:Deleted this atom.
2021-09-06 15:18:23,157 WARNING:biomolecule.py:1001:repair_heavy:Extra atom OP2 in DA I -72! - 
2021-09-06 15:18:23,157 WARNING:biomolecule.py:1003:repair_heavy:Deleted this atom.
2021-09-06 15:18:23,159 WARNING:biomolecule.py:1001:repair_heavy:Extra atom OP1 in DA J -72! - 
2021-09-06 15:18:23,159 WARNING:biomolecule.py:1003:repair_heavy:Deleted this atom.
2021-09-06 15:18:23,159 WARNING:biomolecule.py:1001:repair_heavy:Extra atom OP2 in DA J -72! - 
2021-09-06 15:18:23,159 WARNING:biomolecule.py:1003:repair_heavy:Deleted this atom.

The second (fatal) issue is related to missing backbone atoms in the structure:

2021-09-06 15:18:23,161 CRITICAL:main.py:782:main_driver:Too few atoms present to reconstruct or cap residue GLY M 551 in structure! This error is generally caused by missing backbone atoms in this biomolecule; you must use an external program to complete gaps in the biomolecule backbone. Heavy atoms missing from GLY M 551:  C O OXT CA
2021-09-06 15:18:23,161 CRITICAL:main.py:783:main_driver:Giving up.

However, this behavior is expected and the program should fail because of a gap in the backbone documented in the PDB file:

REMARK 470 MISSING ATOM                                                         
REMARK 470 THE FOLLOWING RESIDUES HAVE MISSING ATOMS (M=MODEL NUMBER;           
REMARK 470 RES=RESIDUE NAME; C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;          
REMARK 470 I=INSERTION CODE):                                                   
REMARK 470   M RES CSSEQI  ATOMS                                                
REMARK 470     GLY M 551    CA   C    O

I've documented this so the user can review and let me know if I've captured the problem correctly.

intendo commented 3 years ago

In the first error, is that because the lines in 5mlu.pdb have pairs like:

ATOM   6172  OP1  DT I -71     -53.961 -54.284  76.914  1.00239.30           O  
ANISOU 6172  OP1  DT I -71    28185  33880  28859  -5452   1734   6887       O

Do the atom names have to be unique across ATOM, ANISOU, and other classes? Also, can the Residue sequence number be negative?

I guess in general, I am asking: What are the options for fixing this pdb file?

sobolevnrm commented 3 years ago

The PDB file isn't broken, our code is. ANISOU is additional information that accompanies the ATOM entry -- PDB2PQR doesn't use it but it looks like it is processing it anyway. I'm not sure why?

sobolevnrm commented 3 years ago

The sign on the residue number likely represents the "sense" of the DNA strand. If a file is in the Protein DataBank and PDB2PQR fails to parse it, we can be ~99% sure it is a problem with our code rather than the PDB entry.

intendo commented 3 years ago

Sorry, I asked the question incorrectly. I should have asked: What is the correct action the code should take to handle the two lines where the ATOM and ANISOU have the same atom name and residue sequence number?

Since ATOM and ANISOU are both classes and parsed when the @register_line_parser decorator in used, there does not seem to be a code path for the ANISOU class instance to find the matching ATOM instance. Should the ANISOU class inherit from the ATOM class? Should we simply ignore/skip ANISOU records?

sobolevnrm commented 3 years ago

I think we should just ignore/skip the ANISOU records; however, I can't figure out why the code is using them at all. Can you tell where in the code that record is being used (rather than just parsed)? I looked quickly and was unable to find it.

intendo commented 2 years ago

Commenting out the ANISOU record parsing did not change anything. I modified the pdb2pqr/biomolecule.py file line 1001 to output the ATOM record:

      _LOGGER.warning(f"Extra atom {atomname} in {residue}! - ({residue.get_atom(atomname)})")

This produced the output showing the ATOM records in question:

WARNING:Extra atom OP1 in DA I -72! - (ATOM   6151  OP1 DA    -72     -45.846 -52.479  76.652  0.0000 0.0000)
WARNING:Deleted this atom.
WARNING:Extra atom OP2 in DA I -72! - (ATOM   6152  OP2 DA    -72     -46.305 -50.401  75.222  0.0000 0.0000)
WARNING:Deleted this atom.
WARNING:Extra atom OP1 in DA J -72! - (ATOM   9142  OP1 DA    -72      -3.967  -0.901  92.640  0.0000 0.0000)
WARNING:Deleted this atom.
WARNING:Extra atom OP2 in DA J -72! - (ATOM   9143  OP2 DA    -72      -2.926  -1.671  90.426  0.0000 0.0000)
WARNING:Deleted this atom.

I am no closer to finding the problem but I am hoping this new information may help one of you see something obvious that I am missing.

sobolevnrm commented 2 years ago

The new version of PDB2PQR (currently in master, release coming soon) fixes the problem with the nucleic acid.

Electrostatics / pdb2pqr

Failing to parse 5MLU #221