This is an ongoing project I am working on which involves the analysis of ~150 trajectories. Still WIP but I am happy to share some insights and some possible limitations when I try to push the parallel analysis of MDA 2.0.0 to its limit. By in-situ analysis, I mean some preliminary analysis on the raw trajectory (unabridged, uncentered, unwrap-needed).
Trajectory System
Multimeric membrane protein (~1500 residues)
~250,000 atoms
PDB file ~20 MB
150 trajectories
still running; should be >1000 frames per trajectory.
XTC file
Possible Issues
Offsets need to be reloaded every time the trajectories are updated and it is taking quite some time. The possible solution is hacking the code or go TNG. #3225
Memory footprint (see benchmark below) of a Universe is quite large considering there are 150 of them especially when Transformations are added. The following code will crash with 64 GB of memory.
Total time: 38.0997 s
File: <ipython-input-13-f2f3d3587ffa>
Function: load_trajectory at line 1
Line # Hits Time Per Hit % Time Line Contents
1 def load_trajectory(trajectory):
2 # This is without reloading offset
3 2 8488084.0 4244042.0 22.3 u = mda.Universe(trajectory + '/top.pdb',
4 1 1.0 1.0 0.0 trajectory + '/md.xtc')
5 # Need bond information for unwrapping,
# maybe guess_bond will work as well?
6 1 8312620.0 8312620.0 21.8 u_bond = mda.Universe(trajectory + '/md.tpr')
7 1 3627191.0 3627191.0 9.5 u.add_bonds(u_bond.bonds.to_indices())
8
9 1 3489.0 3489.0 0.0 u_prot = u.select_atoms('protein')
10 1 1.0 1.0 0.0 prot_chain_list = []
11 11 335.0 30.5 0.0 for chain in u_prot.segments:
12 10 818.0 81.8 0.0 prot_chain_list.append(chain.atoms)
13 1 16044.0 16044.0 0.0 non_prot = u.select_atoms('not protein')
14
15 1 10666896.0 10666896.0 28.0 unwrap_trans = trans.unwrap(u.atoms)
# for grouping protein subunits together
16 1 14.0 14.0 0.0 prot_group = GroupHug(*prot_chain_list)
17 1 25.0 25.0 0.0 center_in_box = trans.center_in_box(u_prot)
18 1 7.0 7.0 0.0 wrap = trans.wrap(non_prot)
19 1 3238.0 3238.0 0.0 rot_fit_trans = trans.fit_rot_trans(u.select_atoms('name CA'), u.select_atoms('name CA'))
20 1 6654341.0 6654341.0 17.5 u.trajectory.add_transformations(*[unwrap_trans, prot_group, center_in_box, wrap, rot_fit_trans])
21 1 5.0 5.0 0.0 del u_bond
22 1 326614.0 326614.0 0.9 gc.collect()
23 1 2.0 2.0 0.0 return u
load_job_list = []
for trajectory in trajectory_list:
load_job_list.append(dask.delayed(load_trajectory)(trajectory))
trajectory_universe_files = dask.compute(load_job_list)[0]
One solution is to serialize the `Universe` into a tmp pickle file and pass it around only with the file name instead of directly in the memory.
- **Memory footprint** of the PCA analysis is quite large if I want to serialize the whole PCA object (for projecting the trajectories into the existing PC space).
- A better way to **organize the analysis**. So far I have been using dask dataframe for better partitioning analysis tasks.
## Benchmark ##
- Offset Reloading
- TODO
- Universe/AtomGroup Serialization
```python
# With Transformations
u = load_trajectory(trajectory_list[0])
with open('universe.pkl', 'wb') as f:
pickle.dump(u, f)
!du -sh ./universe.pkl
57M ./universe.pkl
# Without Transformations
u = mda.Universe(trajectory + '/top.pdb',
trajectory + '/md.xtc')
with open('universe.pkl', 'wb') as f:
pickle.dump(u, f)
!du -sh ./universe.pkl
38M ./universe.pkl
This is an ongoing project I am working on which involves the analysis of ~150 trajectories. Still WIP but I am happy to share some insights and some possible limitations when I try to push the parallel analysis of MDA 2.0.0 to its limit. By in-situ analysis, I mean some preliminary analysis on the raw trajectory (unabridged, uncentered, unwrap-needed).
Trajectory System
Possible Issues
Universe
is quite large considering there are 150 of them especially whenTransformations
are added. The following code will crash with 64 GB of memory.Line # Hits Time Per Hit % Time Line Contents
load_job_list = [] for trajectory in trajectory_list: load_job_list.append(dask.delayed(load_trajectory)(trajectory)) trajectory_universe_files = dask.compute(load_job_list)[0]