MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.26k stars 640 forks source link

Benchmark on Parallel In-situ Analysis #3231

Open yuxuanzhuang opened 3 years ago

yuxuanzhuang commented 3 years ago

This is an ongoing project I am working on which involves the analysis of ~150 trajectories. Still WIP but I am happy to share some insights and some possible limitations when I try to push the parallel analysis of MDA 2.0.0 to its limit. By in-situ analysis, I mean some preliminary analysis on the raw trajectory (unabridged, uncentered, unwrap-needed).

Trajectory System

Possible Issues

Line # Hits Time Per Hit % Time Line Contents

 1                                           def load_trajectory(trajectory):
 2                                                #   This is without reloading offset
 3         2    8488084.0 4244042.0     22.3      u = mda.Universe(trajectory + '/top.pdb',
 4         1          1.0      1.0      0.0                       trajectory + '/md.xtc')
 5                                               #   Need bond information for unwrapping, 
                                                 #   maybe guess_bond will work as well?
 6         1    8312620.0 8312620.0     21.8      u_bond = mda.Universe(trajectory + '/md.tpr')
 7         1    3627191.0 3627191.0      9.5      u.add_bonds(u_bond.bonds.to_indices())
 8                                           
 9         1       3489.0   3489.0      0.0      u_prot = u.select_atoms('protein')
10         1          1.0      1.0      0.0      prot_chain_list = []
11        11        335.0     30.5      0.0      for chain in u_prot.segments:
12        10        818.0     81.8      0.0          prot_chain_list.append(chain.atoms)
13         1      16044.0  16044.0      0.0      non_prot = u.select_atoms('not protein')
14                                           
15         1   10666896.0 10666896.0     28.0    unwrap_trans = trans.unwrap(u.atoms)
                                                 # for grouping protein subunits together
16         1         14.0     14.0      0.0      prot_group = GroupHug(*prot_chain_list)
17         1         25.0     25.0      0.0      center_in_box = trans.center_in_box(u_prot)
18         1          7.0      7.0      0.0      wrap = trans.wrap(non_prot)
19         1       3238.0   3238.0      0.0      rot_fit_trans = trans.fit_rot_trans(u.select_atoms('name CA'), u.select_atoms('name CA'))
20         1    6654341.0 6654341.0     17.5     u.trajectory.add_transformations(*[unwrap_trans, prot_group, center_in_box, wrap, rot_fit_trans])
21         1          5.0      5.0      0.0      del u_bond
22         1     326614.0 326614.0      0.9      gc.collect()
23         1          2.0      2.0      0.0      return u

load_job_list = [] for trajectory in trajectory_list: load_job_list.append(dask.delayed(load_trajectory)(trajectory)) trajectory_universe_files = dask.compute(load_job_list)[0]

One solution is to serialize the `Universe` into a tmp pickle file and pass it around only with the file name instead of directly in the memory.
- **Memory footprint** of the PCA analysis is quite large if I want to serialize the whole PCA object (for projecting the trajectories into the existing PC space).
- A better way to **organize the analysis**. So far I have been using dask dataframe for better partitioning analysis tasks. 

## Benchmark ##
- Offset Reloading
   -  TODO
- Universe/AtomGroup Serialization
```python
# With Transformations
u = load_trajectory(trajectory_list[0])
with open('universe.pkl', 'wb') as f:
    pickle.dump(u, f)
!du -sh ./universe.pkl
57M ./universe.pkl

# Without Transformations
u = mda.Universe(trajectory + '/top.pdb',
                               trajectory + '/md.xtc')
with open('universe.pkl', 'wb') as f:
    pickle.dump(u, f)
!du -sh ./universe.pkl
38M ./universe.pkl
orbeckst commented 3 years ago

@edisj might also be interested in following your progress. He is looking at the I/O side with MPI and h5md.