Incorrect residue numbering for large DCD + PSF universe

dstelter92 commented 5 years ago

Expected behavior

len(Universe.residues) should provide the correct number of residues for the system.

Actual behavior

len(U.residues) is incorrect and inconsistent with the supplied PSF file for large (>10000 residues) systems only.

I have a large homogenous molecular system with 17,600 molecules (12 atoms/molecule) that I am treating as residues for analysis. I expect that the len(U.residues) should give me that, but instead it gives 10760, which is incorrect. I have verified my PSF and there are indeed 17,600 residues (confirmed by loading into VMD). It seems that excess atoms are being assigned to residues with index > 10000 leading to the incorrect count. The total number of atoms in the universe is correct.

I have an identical system setup with 5,100 molecules that has no issues, which is why I think this bug depends on the number of residues.

It's also worth mentioning that this leads to incorrect behavior when identifying the resid for a particular atom, ie: Universe.atoms[-1].resid

I'm more than happy to provide my inputs for testing.

EDIT: I should add, the DCD is from a LAMMPS trajectory.

Code to reproduce the behavior

I don't think this will work with any of the test data as I think the bug requires > 9999 residues.

import MDAnalysis as mda

u = mda.Universe(PSF, DCD)
print(len(u.atoms.residues)
for i in range(len(u.atoms.residues)):
  natom = len(U.residues[i].atoms)
  if natom != 12 # all molecules in my system have 12 atoms
    print(natom, u.residues[i]) # should print nothing for correct behavior

When I run this code, I find that every residue from 1000 -> 1759 has 10x the atoms that it should (120 atoms/residue). Any help here would be greatly appreciated, thanks for your time.

Currently version of MDAnalysis

Which version are you using? 0.18.0
Which version of Python (python -V)? 2.7.12
Which operating system? Ubuntu 16.04

richardjgowers commented 5 years ago

Hi David

Thanks for the detailed issue, really helps to solve these things quickly.

If you could provide an example PSF that does this it would be handy!

Richard On Aug 30, 2018, 12:24 PM -0500, David Stelter notifications@github.com, wrote:

Expected behavior len(Universe.residues) should provide the correct number of residues for the system. Actual behavior len(U.residues) is incorrect and inconsistent with the supplied PSF file for large (>1000 residues) systems only. I have a large homogenous molecular system with 17,600 molecules (12 atoms/molecule) that I am treating as residues for analysis. I expect that the len(U.residues) should give me that, but instead it gives 10760, which is incorrect. I have verified my PSF and there are indeed 17,600 residues (confirmed by loading into VMD). It seems that excess atoms are being assigned to residues with index > 10000 leading to the incorrect count. The total number of atoms in the universe is correct. I have an identical system setup with 5,100 molecules that has no issues, which is why I think this bug depends on the number of residues. It's also worth mentioning that this leads to incorrect behavior when identifying the resid for a particular atom, ie: Universe.atoms[-1].resid I'm more than happy to provide my inputs for testing,. Code to reproduce the behavior I don't think this will work with any of the test data as I think the bug requires > 999 residues. import MDAnalysis as mda

u = mda.Universe(PSF, DCD) print(len(u.atoms.residues) for i in range(len(u.atoms.residues)): natom = len(U.residues[i].atoms) if natom != 12 # all molecules in my system have 12 atoms print(natom, u.residues[i]) # should print nothing for correct behavior When I run this code, I find that every residue from 1000 -> 1759 has 10x the atoms that it should (120 atoms/residue). Any help here would be greatly appreciated, thanks for your time. Currently version of MDAnalysis

• Which version are you using? 0.18.0 • Which version of Python (python -V)? 2.7.12 • Which operating system? Ubuntu 16.04

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

dstelter92 commented 5 years ago

Hi Richard,

Sure. Please see the following google drive link, both the DCD (not 100% sure that will work) and PSF are there. Thanks for the quick response! https://drive.google.com/drive/folders/1Hk8s8h9pag2rFY8XQKlV9DDQXlxWpyK2?usp=sharing

David

dstelter92 commented 5 years ago

So I did some investigating, and the PSF becomes shifted at residue 10000 (line 119995). Removing this shift leads to 'correct-er' behaviour for MDAnalysis (at the expense of no longer loading in VMD). I think this is likely due to the hard-coded indicies in PSFParser._parseatoms, specifically the atom_parser dictionary.

orbeckst commented 5 years ago

The transition at

  119986      9999      4    4      0.000000       72.0000           0
  119987      9999      4    4      0.000000       72.0000           0
  119988      9999      4    4      0.000000       72.0000           0
  119989      10000      1    1      1.000000       72.0000           0
  119990      10000      2    2     -1.000000       72.0000           0
  119991      10000      3    3      0.000000       72.0000           0

is suspicious but I think the problem lies elsewhere, namely that the file does not make clear in which format it is in – see below.

How was the PSF written? The comments say "VMD-generated NAMD/X-Plor PSF structure file". My understanding is that VMD ignores the official CHARMM PSF format (which is fixed column) and instead does everything space-separated. It should insert a flag "NAMD" in the header. See comments near https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/package/MDAnalysis/topology/PSFParser.py#L215

However, because there is no such flag, MDAnalysis parses the PSF file according to the rules of "STANDARD" PSF format.

It should be parseable as a "NAMD" style PSF. However, when I added "NAMD" to the top, it does not work:

p = mda.topology.PSFParser.PSFParser("system_namd.psf")
topol = p.parse()

it fails with an IndexError

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-26-7e0e731a2f37> in <module>()
----> 1 topol = p.parse()
[...]
/Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/mdanalysis/package/MDAnalysis/topology/PSFParser.py in <lambda>(x)
    250         # once partitioned, assigned each component the correct type
    251         set_type = lambda x: (int(x[0]) - 1, x[1] or "SYSTEM", int(x[2]), x[3],
--> 252                               x[4], x[5], float(x[6]), float(x[7]))
    253

IndexError: list index out of range

This is because the PSF file contains too few entries:

ipdb> p atom_parser(line)
['1', '1', '1', '1', '1.000000', '72.0000', '0']
ipdb> p set_type(atom_parser(line))
*** IndexError: list index out of range

In set_type https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/package/MDAnalysis/topology/PSFParser.py#L251 we want to see at least 8 columns https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/package/MDAnalysis/topology/PSFParser.py#L206 (we ignore IMOVE) but there are only 7 in this PSF file. I think you are missing LSEGID in your PSF file.

For a start, can you tell us more about the PSF file format that you are using? What do the columns mean?

orbeckst commented 5 years ago

Also, any documents that describe the file format specifications would be welcome.

dstelter92 commented 5 years ago

How was the PSF written?

It was written in VMD from a LAMMPS data file read in via topotools. Something like topotools readlammpsdata system.data animate write psf system.psf $sel

Also, any documents that describe the file format specifications would be welcome.

Sure, TopoTools is here. Animate is a standard utility of VMD, not sure where to find info on how animate writes psf files.

What do the columns mean?

I think the columns are nonstandard as they are based on what topotools was able to load into VMD from the default data file. LAMMPS data files depend on the atom_style, mine was "full". The specific columns on this PSF are: AtomID ResID AtomName AtomType Charge Mass Unused0

I think LAMMPS does not define SegmentID or ResNames, so they are missing.

orbeckst commented 5 years ago

If there are missing columns then a fixed column format really seems to be the only sensible approach. Otherwise I don't know how the parser should know what your columns mean.

At the moment, the lines up to the shift at residue 10000 are read with the column-based parser. At this line, the column-based parser breaks and we try the NAMD "split on whitespace" parser as the last resort. This one now fails because there are too few columns.

I am not quite sure yet what we should actually be doing. Suggestions welcome.

dstelter92 commented 5 years ago

At the moment, the lines up to the shift at residue 10000 are read with the column-based parser. At this line, the column-based parser breaks and we try the NAMD "split on whitespace" parser as the last resort. This one now fails because there are too few columns.

I tried a workaround to just put dummy LSEGID and LRES columns in where they should be (without with "NAMD PSF" header, can't load any PSF with that option). I needed to do some manual (thank god for visual blocks) column editing to get this to work. This passes my test example above, but will need to look into it with more detail.

I am not quite sure yet what we should actually be doing. Suggestions welcome.

I do think anyone coming from LAMMPS will have this issue, BUT... Perhaps the easiest solution is to just NOT use a PSF, and rather read in with a LAMMPS topology parser. Probably worth exploring...

For a long term solution, I think it would be useful to look into how VMD's PSF parser handles this file, as loading this file into VMD yields correct behavior.

richardjgowers commented 5 years ago

Yeah I had a quick look at this issue before, whitespace separated with missing columns is a read headache to parse (esp. without killing performance).

@dstelter92 Just a quick note, the parsers/readers are defined on the fly in MDAnalysis, so you can easily monkeypatch without having to have a custom installation of MDAnalysis. eg:

# in your analysis script, or imported **after** MDAnalysis

class MyPSFParser():
    # setting this is what makes MDAnalysis recognise this as class for PSF files
    format = 'PSF'

    # copy paste existing code
    # modify to parse these weird files

u = mda.Universe('something.psf', ...)  # will now use the class above not the default

MDAnalysis / mdanalysis

Incorrect residue numbering for large DCD + PSF universe #2061