Open dstelter92 opened 5 years ago
Hi David
Thanks for the detailed issue, really helps to solve these things quickly.
If you could provide an example PSF that does this it would be handy!
Richard On Aug 30, 2018, 12:24 PM -0500, David Stelter notifications@github.com, wrote:
Expected behavior len(Universe.residues) should provide the correct number of residues for the system. Actual behavior len(U.residues) is incorrect and inconsistent with the supplied PSF file for large (>1000 residues) systems only. I have a large homogenous molecular system with 17,600 molecules (12 atoms/molecule) that I am treating as residues for analysis. I expect that the len(U.residues) should give me that, but instead it gives 10760, which is incorrect. I have verified my PSF and there are indeed 17,600 residues (confirmed by loading into VMD). It seems that excess atoms are being assigned to residues with index > 10000 leading to the incorrect count. The total number of atoms in the universe is correct. I have an identical system setup with 5,100 molecules that has no issues, which is why I think this bug depends on the number of residues. It's also worth mentioning that this leads to incorrect behavior when identifying the resid for a particular atom, ie: Universe.atoms[-1].resid I'm more than happy to provide my inputs for testing,. Code to reproduce the behavior I don't think this will work with any of the test data as I think the bug requires > 999 residues. import MDAnalysis as mda
u = mda.Universe(PSF, DCD) print(len(u.atoms.residues) for i in range(len(u.atoms.residues)): natom = len(U.residues[i].atoms) if natom != 12 # all molecules in my system have 12 atoms print(natom, u.residues[i]) # should print nothing for correct behavior When I run this code, I find that every residue from 1000 -> 1759 has 10x the atoms that it should (120 atoms/residue). Any help here would be greatly appreciated, thanks for your time. Currently version of MDAnalysis
• Which version are you using? 0.18.0 • Which version of Python (python -V)? 2.7.12 • Which operating system? Ubuntu 16.04
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Richard,
Sure. Please see the following google drive link, both the DCD (not 100% sure that will work) and PSF are there. Thanks for the quick response! https://drive.google.com/drive/folders/1Hk8s8h9pag2rFY8XQKlV9DDQXlxWpyK2?usp=sharing
David
So I did some investigating, and the PSF becomes shifted at residue 10000 (line 119995). Removing this shift leads to 'correct-er' behaviour for MDAnalysis (at the expense of no longer loading in VMD). I think this is likely due to the hard-coded indicies in PSFParser._parseatoms, specifically the atom_parser dictionary.
The transition at
119986 9999 4 4 0.000000 72.0000 0
119987 9999 4 4 0.000000 72.0000 0
119988 9999 4 4 0.000000 72.0000 0
119989 10000 1 1 1.000000 72.0000 0
119990 10000 2 2 -1.000000 72.0000 0
119991 10000 3 3 0.000000 72.0000 0
is suspicious but I think the problem lies elsewhere, namely that the file does not make clear in which format it is in – see below.
How was the PSF written? The comments say "VMD-generated NAMD/X-Plor PSF structure file". My understanding is that VMD ignores the official CHARMM PSF format (which is fixed column) and instead does everything space-separated. It should insert a flag "NAMD" in the header. See comments near https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/package/MDAnalysis/topology/PSFParser.py#L215
However, because there is no such flag, MDAnalysis parses the PSF file according to the rules of "STANDARD" PSF format.
It should be parseable as a "NAMD" style PSF. However, when I added "NAMD" to the top, it does not work:
p = mda.topology.PSFParser.PSFParser("system_namd.psf")
topol = p.parse()
it fails with an IndexError
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-26-7e0e731a2f37> in <module>()
----> 1 topol = p.parse()
[...]
/Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/mdanalysis/package/MDAnalysis/topology/PSFParser.py in <lambda>(x)
250 # once partitioned, assigned each component the correct type
251 set_type = lambda x: (int(x[0]) - 1, x[1] or "SYSTEM", int(x[2]), x[3],
--> 252 x[4], x[5], float(x[6]), float(x[7]))
253
IndexError: list index out of range
This is because the PSF file contains too few entries:
ipdb> p atom_parser(line)
['1', '1', '1', '1', '1.000000', '72.0000', '0']
ipdb> p set_type(atom_parser(line))
*** IndexError: list index out of range
In set_type
https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/package/MDAnalysis/topology/PSFParser.py#L251 we want to see at least 8 columns
https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/package/MDAnalysis/topology/PSFParser.py#L206 (we ignore IMOVE) but there are only 7 in this PSF file. I think you are missing LSEGID in your PSF file.
For a start, can you tell us more about the PSF file format that you are using? What do the columns mean?
Also, any documents that describe the file format specifications would be welcome.
How was the PSF written?
It was written in VMD from a LAMMPS data file read in via topotools. Something like
topotools readlammpsdata system.data
animate write psf system.psf $sel
Also, any documents that describe the file format specifications would be welcome.
Sure, TopoTools is here. Animate is a standard utility of VMD, not sure where to find info on how animate writes psf files.
What do the columns mean?
I think the columns are nonstandard as they are based on what topotools was able to load into VMD from the default data file. LAMMPS data files depend on the atom_style, mine was "full". The specific columns on this PSF are: AtomID ResID AtomName AtomType Charge Mass Unused0
I think LAMMPS does not define SegmentID or ResNames, so they are missing.
If there are missing columns then a fixed column format really seems to be the only sensible approach. Otherwise I don't know how the parser should know what your columns mean.
At the moment, the lines up to the shift at residue 10000 are read with the column-based parser. At this line, the column-based parser breaks and we try the NAMD "split on whitespace" parser as the last resort. This one now fails because there are too few columns.
I am not quite sure yet what we should actually be doing. Suggestions welcome.
At the moment, the lines up to the shift at residue 10000 are read with the column-based parser. At this line, the column-based parser breaks and we try the NAMD "split on whitespace" parser as the last resort. This one now fails because there are too few columns.
I tried a workaround to just put dummy LSEGID and LRES columns in where they should be (without with "NAMD PSF" header, can't load any PSF with that option). I needed to do some manual (thank god for visual blocks) column editing to get this to work. This passes my test example above, but will need to look into it with more detail.
I am not quite sure yet what we should actually be doing. Suggestions welcome.
I do think anyone coming from LAMMPS will have this issue, BUT... Perhaps the easiest solution is to just NOT use a PSF, and rather read in with a LAMMPS topology parser. Probably worth exploring...
For a long term solution, I think it would be useful to look into how VMD's PSF parser handles this file, as loading this file into VMD yields correct behavior.
Yeah I had a quick look at this issue before, whitespace separated with missing columns is a read headache to parse (esp. without killing performance).
@dstelter92 Just a quick note, the parsers/readers are defined on the fly in MDAnalysis, so you can easily monkeypatch without having to have a custom installation of MDAnalysis. eg:
# in your analysis script, or imported **after** MDAnalysis
class MyPSFParser():
# setting this is what makes MDAnalysis recognise this as class for PSF files
format = 'PSF'
# copy paste existing code
# modify to parse these weird files
u = mda.Universe('something.psf', ...) # will now use the class above not the default
Expected behavior
len(Universe.residues) should provide the correct number of residues for the system.
Actual behavior
len(U.residues) is incorrect and inconsistent with the supplied PSF file for large (>10000 residues) systems only.
I have a large homogenous molecular system with 17,600 molecules (12 atoms/molecule) that I am treating as residues for analysis. I expect that the len(U.residues) should give me that, but instead it gives 10760, which is incorrect. I have verified my PSF and there are indeed 17,600 residues (confirmed by loading into VMD). It seems that excess atoms are being assigned to residues with index > 10000 leading to the incorrect count. The total number of atoms in the universe is correct.
I have an identical system setup with 5,100 molecules that has no issues, which is why I think this bug depends on the number of residues.
It's also worth mentioning that this leads to incorrect behavior when identifying the resid for a particular atom, ie: Universe.atoms[-1].resid
I'm more than happy to provide my inputs for testing.
EDIT: I should add, the DCD is from a LAMMPS trajectory.
Code to reproduce the behavior
I don't think this will work with any of the test data as I think the bug requires > 9999 residues.
When I run this code, I find that every residue from 1000 -> 1759 has 10x the atoms that it should (120 atoms/residue). Any help here would be greatly appreciated, thanks for your time.
Currently version of MDAnalysis
python -V
)? 2.7.12