MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.32k stars 651 forks source link

Cannot access frames or slice trajectories for large or combined dcds #4777

Open vishlashkari opened 2 weeks ago

vishlashkari commented 2 weeks ago

Expected behavior

  1. Slice trajectories as described in the userguide. Eventual goal is to concatenate / combine a bunch of .dcd trajectories (load all at once) and do analysis on every 10th frame, so basically exactly #3075

  2. Use existing workaround to slice individual trajectories into new dcds and then concatenate those.

Actual behavior

  1. When iterating over a dcd (by any more than 1 frame at a time) or when attempting to access frames 939+ (have tried many dcds):
    
    OSError                                   Traceback (most recent call last)
    Cell In[3], line 1
    ----> 1 for ts in u.trajectory[::10]:
      2     print(ts.frame)

File ~\AppData\Local\anaconda3\envs\mdanalysis\Lib\site-packages\MDAnalysis\coordinates\base.py:211, in FrameIteratorSliced.iter(self) 209 def iter(self): 210 for i in range(self.start, self.stop, self.step): --> 211 yield self.trajectory[i] 212 self.trajectory.rewind()

File ~\AppData\Local\anaconda3\envs\mdanalysis\Lib\site-packages\MDAnalysis\coordinates\base.py:833, in ProtoReader.getitem(self, frame) 831 if isinstance(frame, numbers.Integral): 832 frame = self._apply_limits(frame) --> 833 return self._read_frame_withaux(frame) 834 elif isinstance(frame, (list, np.ndarray)): 835 if len(frame) != 0 and isinstance(frame[0], (bool, np.bool)): 836 # Avoid having list of bools

File ~\AppData\Local\anaconda3\envs\mdanalysis\Lib\site-packages\MDAnalysis\coordinates\base.py:865, in ProtoReader._read_frame_with_aux(self, frame) 863 def _read_frame_with_aux(self, frame): 864 """Move to frame, updating ts with trajectory, transformations and auxiliary data.""" --> 865 ts = self._read_frame(frame) # pylint: disable=assignment-from-no-return 866 for aux in self.aux_list: 867 ts = self._auxs[aux].update_ts(ts)

File ~\AppData\Local\anaconda3\envs\mdanalysis\Lib\site-packages\MDAnalysis\coordinates\DCD.py:198, in DCDReader._read_frame(self, i) 196 """read frame i""" 197 self._frame = i - 1 --> 198 self._file.seek(i) 199 return self._read_next_timestep()

File ~\AppData\Local\anaconda3\envs\mdanalysis\Lib\site-packages\MDAnalysis\lib\formats\libdcd.pyx:400, in MDAnalysis.lib.formats.libdcd.DCDFile.seek()

OSError: DCD seek failed with DCD error=Normal EOF


2. When using the workaround and only writing every 10th frame manually with i % 10 == 0 (no slice), I can iterate to the end of the dcd with no issue and write every 10th frame. However, I can only do this if 1 dcd is loaded. If I try to load 2 dcds, then I get the same error. So I can do this on all individual dcds in my overall simulation. Then I can load all of the shortened trajectories at once and avoid the above problem by simply using every frame (the stride/slice has already been applied now). However, the issue is that the ts.time values of the new trajectory are not correct and the original ts.time information seems to have been lost. The frames of the new trajectory are all separated by ~1 picosecond. The time also resets to 0 at the frame where two shortened dcds got concatenated. (e.g. times = 1,2,...,999,1000,0,1,2,...) So I cannot graph my metrics against the original time scale as I intended. A workaround would be to use my step size and frame save rate, but I am wondering if this behavior is expected. This time discrepancy in written dcds is also present on Linux.

In short, the second problem is that when writing a dcd from a loaded dcd, the time of each frame is incorrect/lost.

## Code to reproduce the behavior ##
1. The test files are not long enough for this issue. My error begins consistently at frame 939. When I put 10x of the test DCD to get 940+ frames, I don't get any of these errors. But in that case I am using the test PSF because the test PDB gives an error with the test DCD. When I run this with my exact same files on Linux, I also don't get this first problem.
<!-- Show us how to reproduce the failure. If you can, use trajectory files from the test data. Use the code snipped below as a starting point. -->

Iterating over every frame with no skips produces no errors.

for ts in u.trajectory: print(ts.frame)

Directly from the userguide:

u = mda.Universe('test.pdb', 'test.dcd') fiter = u.trajectory[10::10] frames = [ts.frame for ts in fiter] print(frames, u.trajectory.frame)

Out: ... OSError: DCD seek failed with DCD error=Normal EOF

And similarly, this also fails if the 10 is replaced with anything other than 1:

u = mda.Universe('test.pdb', 'test.dcd') for ts in u.trajectory[::10]: print(ts.frame)

Out: ... OSError: DCD seek failed with DCD error=Normal EOF

Accessing any frame 938 or before is fine.

In: u.trajectory[938] Out: < Timestep 938 with unit cell dimensions [132.12038 120.366974 119.66685 90. 90. 90. ] >

But for all frames 939+:

In: u.trajectory[939] Out: ... OSError: DCD seek failed with DCD error=Normal EOF


2. Workaround with 1 dcd loaded works, but the times of the frames of the written dcd do not match that of the original:

u = mda.Universe("test.pdb", "test.dcd") ag = u.select_atoms('all') with mda.Writer('out.dcd', ag.n_atoms) as w: for ts in u.trajectory: if ts.frame % 10 == 0: # Write every 10th frame w.write(ag.atoms) print(ts.frame) print('Done')


With multiple dcds loaded, prints Done with no errors but the last frame recorded is 930, and frame times are still incorrect:

u = mda.Universe("test.pdb", "test.dcd", "test.dcd") ag = u.select_atoms('all') with mda.Writer('out.dcd', ag.n_atoms) as w: for ts in u.trajectory: if ts.frame % 10 == 0: # Write every 10th frame w.write(ag.atoms) print(ts.frame) print('Done')



## Current version of MDAnalysis ##

- Which version are you using? 2.7.0
- Which version of Python (`python -V`)? Python 3.12.0 (using JupyterLab or Spyder)
- Which operating system? Windows 11 (first issue does not occur on Linux with same pdb and dcd files, frame times still an issue)
orbeckst commented 1 day ago
  1. How did you produce your DCD files? Does your program properly record the number of frames in it?
  2. How big (in GiB) are your DCD files? How many atoms per frame?
  3. You can try the DCD from https://www.mdanalysis.org/MDAnalysisData/adk_equilibrium.html which contains >4000 frames (install MDAnalysisData with e.g. mamba install mdanalysisdata). This works for me:
    
    In [1]: from MDAnalysisData import datasets

In [2]: adk = datasets.fetch_adk_equilibrium()

In [3]: import MDAnalysis as mda

In [4]: u = mda.Universe(adk.topology, adk.trajectory) ~/anaconda3/envs/mda311/lib/python3.11/site-packages/MDAnalysis/coordinates/DCD.py:165: DeprecationWarning: DCDReader currently makes independent timesteps by copying self.ts while other readers update self.ts inplace. This behavior will be changed in 3.0 to be the same as other readers. Read more at https://github.com/MDAnalysis/mdanalysis/issues/3889 to learn if this change in behavior might affect you. warnings.warn("DCDReader currently makes independent timesteps"

In [5]: u.trajectory Out[5]: <DCDReader ~/MDAnalysis_data/adk_equilibrium/1ake_007-nowater-core-dt240ps.dcd with 4187 frames of 3341 atoms>

In [6]: frames = [ts.frame for ts in u.trajectory[10::10]]

In [7]: print(frames, u.trajectory.frame) [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000, 1010, 1020, 1030, 1040, 1050, 1060, 1070, 1080, 1090, 1100, 1110, 1120, 1130, 1140, 1150, 1160, 1170, 1180, 1190, 1200, 1210, 1220, 1230, 1240, 1250, 1260, 1270, 1280, 1290, 1300, 1310, 1320, 1330, 1340, 1350, 1360, 1370, 1380, 1390, 1400, 1410, 1420, 1430, 1440, 1450, 1460, 1470, 1480, 1490, 1500, 1510, 1520, 1530, 1540, 1550, 1560, 1570, 1580, 1590, 1600, 1610, 1620, 1630, 1640, 1650, 1660, 1670, 1680, 1690, 1700, 1710, 1720, 1730, 1740, 1750, 1760, 1770, 1780, 1790, 1800, 1810, 1820, 1830, 1840, 1850, 1860, 1870, 1880, 1890, 1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020, 2030, 2040, 2050, 2060, 2070, 2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190, 2200, 2210, 2220, 2230, 2240, 2250, 2260, 2270, 2280, 2290, 2300, 2310, 2320, 2330, 2340, 2350, 2360, 2370, 2380, 2390, 2400, 2410, 2420, 2430, 2440, 2450, 2460, 2470, 2480, 2490, 2500, 2510, 2520, 2530, 2540, 2550, 2560, 2570, 2580, 2590, 2600, 2610, 2620, 2630, 2640, 2650, 2660, 2670, 2680, 2690, 2700, 2710, 2720, 2730, 2740, 2750, 2760, 2770, 2780, 2790, 2800, 2810, 2820, 2830, 2840, 2850, 2860, 2870, 2880, 2890, 2900, 2910, 2920, 2930, 2940, 2950, 2960, 2970, 2980, 2990, 3000, 3010, 3020, 3030, 3040, 3050, 3060, 3070, 3080, 3090, 3100, 3110, 3120, 3130, 3140, 3150, 3160, 3170, 3180, 3190, 3200, 3210, 3220, 3230, 3240, 3250, 3260, 3270, 3280, 3290, 3300, 3310, 3320, 3330, 3340, 3350, 3360, 3370, 3380, 3390, 3400, 3410, 3420, 3430, 3440, 3450, 3460, 3470, 3480, 3490, 3500, 3510, 3520, 3530, 3540, 3550, 3560, 3570, 3580, 3590, 3600, 3610, 3620, 3630, 3640, 3650, 3660, 3670, 3680, 3690, 3700, 3710, 3720, 3730, 3740, 3750, 3760, 3770, 3780, 3790, 3800, 3810, 3820, 3830, 3840, 3850, 3860, 3870, 3880, 3890, 3900, 3910, 3920, 3930, 3940, 3950, 3960, 3970, 3980, 3990, 4000, 4010, 4020, 4030, 4040, 4050, 4060, 4070, 4080, 4090, 4100, 4110, 4120, 4130, 4140, 4150, 4160, 4170, 4180] 0