xyzpdb invalid spacing on large xyz files

BJWiley233 commented 2 years ago

Hi Dr. Ponder,

I noticed while trying to visualize PDB files of solvated systems that the spacing is off after running xyzpdb. You will see pdbfixer says the spacing is off and Chimera won't parse it correctly. For instance I have 32332 water molecules plus ions and my protein making it over 100,000 atoms.

$ awk '$6==349' Ribociclib_Bind_CompE0_V0_R1_compboxproddyn.xyz | wc -l 
32332
$ wc -l Ribociclib_Bind_CompE0_V0_R1_compboxproddyn.xyz
102404 Ribociclib_Bind_CompE0_V0_R1_compboxproddyn.xyz

# with key and seq files
xyzpdb Ribociclib_Bind_CompE0_V0_R1_compboxproddyn.xyz

# pdbfixer error
$ pdbfixer Ribociclib_Bind_CompE0_V0_R1_compboxproddyn.pdb
Traceback (most recent call last):
  File "/Users/brian/anaconda3/lib/python3.9/site-packages/openmm/app/internal/pdbstructure.py", line 749, in __init__
    raise ValueError('Misaligned residue name: %s' % pdb_line)
ValueError: Misaligned residue name: ATOM       1  N   MET A    1      44.327  17.928 -33.776

This is the first line of the pdb file output from xyzpdb along with the correctly spaced line below it if I remove all waters from the xyz file first. Then pdbfixer works and I can also view in Chimera

ATOM       1  N   MET A    1      44.327  17.928 -33.776 # incorrect alignment because of large xyz file
ATOM      1  N   MET A   1      44.327  17.928 -33.776 # correctly spaced

After replacing first line above with the second correctly formatted line then you see error at next pdb line:

Traceback (most recent call last):
  File "/Users/brian/anaconda3/lib/python3.9/site-packages/openmm/app/internal/pdbstructure.py", line 749, in __init__
    raise ValueError('Misaligned residue name: %s' % pdb_line)
ValueError: Misaligned residue name: ATOM       2  CA  MET A    1      43.735  17.998 -32.461

BJWiley233 commented 2 years ago

pdbfixer runs with over 10,000 atoms so look like it's just files formatted with over 100,000 atoms. The only thing looks like pdbfixer doesn't like the H atom names for the waters and keeps them there in addition to "fixing" them with H1 and H2 water hydrogen types.

this:

HETATM 5279  O   HOH   328      -5.901 -34.831  -7.173
HETATM 5280  H   HOH   328      -6.511 -35.535  -7.025
HETATM 5281  H   HOH   328      -5.629 -34.960  -8.066

goes to this below after pdbfixer which might be something I can ask Peter about at OpenMM, although just making xyzpdb make these 2 water H's an H1 and H2; that would also work :smile: but maybe you don't want to change xyzpdb because going back might also need to change

HETATM 5280  O   HOH   328      -5.901 -34.831  -7.173  1.00  0.00           O  
HETATM 5281  H1  HOH   328      -5.124 -34.838  -6.240  1.00  0.00           H  
HETATM 5282  H2  HOH   328      -6.058 -33.632  -7.074  1.00  0.00           H  
HETATM 5283  H   HOH   328      -6.511 -35.535  -7.025  1.00  0.00           H  
HETATM 5284  H   HOH   328      -5.629 -34.960  -8.066  1.00  0.00           H

jayponder commented 1 year ago

Hi Brian,

Sorry for not replying to this sooner. Official PDB files, as per the RCSB PDB website, are only allowed to have 99999 atoms or fewer. The PDB format standard only allows five fixed columns (columns 7-11) for the serial atom number.

So anything that any program, such as XYZPDB, OpenMM "pdbfixer", Chimera, etc., does for PDB files with over 99999 atoms is by definition "nonstandard". The main "problem" occurs more with the HETATM records than with the ATOM records. In HETATM records, a serial atom number between 10000 and 99999 will run up against the HETATM tag as in "HETATM99999..". This makes it harder to parse the file using some kind of free format parsing. And when the number of atoms is 100000 or greater, then something has to give! Tinker's choice is to actually move the columns to the right to make room for larger numbers. Apparently other programs make other choices- but, again, there is no "standard".

Have you been able to figure out what Chimera, for example, wants for the formatting of these large (illegal!) PDB files? I'm willing to change Tinker to make it more compatible with some of the viewing programs, but I'm not sure there is a perfect, general solution.

BJWiley233 commented 1 year ago

I realized that Chimera is awful for large systems over 25,000 molecules so I don't even looked at solvated proteins in Chimera any more. I just do it in VMD if I was actually interested in the solvated protein. I will have to go back and review the details of my initial post to see if I run in to issues looking at it in VMD.

TinkerTools / tinker

xyzpdb invalid spacing on large xyz files #118