This is an ongoing project I am working on which involves the analysis of ~150 trajectories. Still WIP but I am happy to share some insights and some possible limitations when I try to push the parallel analysis of MDA 2.0.0 to its limit. By in-situ analysis, I mean some preliminary analysis on the raw trajectory (unabridged, uncentered, unwrap-needed).

Trajectory System

Multimeric membrane protein (~1500 residues)
~250,000 atoms
PDB file ~20 MB
150 trajectories
still running; should be >1000 frames per trajectory.
XTC file

Possible Issues

Offsets need to be reloaded every time the trajectories are updated and it is taking quite some time. The possible solution is hacking the code or go TNG. #3225
Memory footprint (see benchmark below) of a Universe is quite large considering there are 150 of them especially when Transformations are added. The following code will crash with 64 GB of memory.
```
Total time: 38.0997 s
File: <ipython-input-13-f2f3d3587ffa>
Function: load_trajectory at line 1
```

Line # Hits Time Per Hit % Time Line Contents

 1                                           def load_trajectory(trajectory):
 2                                                #   This is without reloading offset
 3         2    8488084.0 4244042.0     22.3      u = mda.Universe(trajectory + '/top.pdb',
 4         1          1.0      1.0      0.0                       trajectory + '/md.xtc')
 5                                               #   Need bond information for unwrapping, 
                                                 #   maybe guess_bond will work as well?
 6         1    8312620.0 8312620.0     21.8      u_bond = mda.Universe(trajectory + '/md.tpr')
 7         1    3627191.0 3627191.0      9.5      u.add_bonds(u_bond.bonds.to_indices())
 8                                           
 9         1       3489.0   3489.0      0.0      u_prot = u.select_atoms('protein')
10         1          1.0      1.0      0.0      prot_chain_list = []
11        11        335.0     30.5      0.0      for chain in u_prot.segments:
12        10        818.0     81.8      0.0          prot_chain_list.append(chain.atoms)
13         1      16044.0  16044.0      0.0      non_prot = u.select_atoms('not protein')
14                                           
15         1   10666896.0 10666896.0     28.0    unwrap_trans = trans.unwrap(u.atoms)
                                                 # for grouping protein subunits together
16         1         14.0     14.0      0.0      prot_group = GroupHug(*prot_chain_list)
17         1         25.0     25.0      0.0      center_in_box = trans.center_in_box(u_prot)
18         1          7.0      7.0      0.0      wrap = trans.wrap(non_prot)
19         1       3238.0   3238.0      0.0      rot_fit_trans = trans.fit_rot_trans(u.select_atoms('name CA'), u.select_atoms('name CA'))
20         1    6654341.0 6654341.0     17.5     u.trajectory.add_transformations(*[unwrap_trans, prot_group, center_in_box, wrap, rot_fit_trans])
21         1          5.0      5.0      0.0      del u_bond
22         1     326614.0 326614.0      0.9      gc.collect()
23         1          2.0      2.0      0.0      return u

load_job_list = [] for trajectory in trajectory_list: load_job_list.append(dask.delayed(load_trajectory)(trajectory)) trajectory_universe_files = dask.compute(load_job_list)[0]

One solution is to serialize the `Universe` into a tmp pickle file and pass it around only with the file name instead of directly in the memory.
- **Memory footprint** of the PCA analysis is quite large if I want to serialize the whole PCA object (for projecting the trajectories into the existing PC space).
- A better way to **organize the analysis**. So far I have been using dask dataframe for better partitioning analysis tasks. 

## Benchmark ##
- Offset Reloading
   -  TODO
- Universe/AtomGroup Serialization
```python
# With Transformations
u = load_trajectory(trajectory_list[0])
with open('universe.pkl', 'wb') as f:
    pickle.dump(u, f)
!du -sh ./universe.pkl
57M ./universe.pkl

# Without Transformations
u = mda.Universe(trajectory + '/top.pdb',
                               trajectory + '/md.xtc')
with open('universe.pkl', 'wb') as f:
    pickle.dump(u, f)
!du -sh ./universe.pkl
38M ./universe.pkl

Memory Footprint

%mprun -f load_trajectory u = load_trajectory(trajectory_list[0])
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
 6    157.9 MiB    157.9 MiB           1   def load_trajectory(trajectory):
 7                                             #    print(trajectory)
 8    234.9 MiB     77.0 MiB           2       u = mda.Universe(trajectory + '/top.pdb',
 9    157.9 MiB      0.0 MiB           1                        trajectory + '/md.xtc')
10                                         
11    450.7 MiB    215.9 MiB           1       u_bond = mda.Universe(trajectory + '/md.tpr')
12    487.1 MiB     36.4 MiB           1       u.add_bonds(u_bond.bonds.to_indices())
13                                         
14    487.1 MiB      0.0 MiB           1       u_prot = u.select_atoms('protein')
15    487.1 MiB      0.0 MiB           1       prot_chain_list = []
16    487.1 MiB      0.0 MiB          11       for chain in u_prot.segments:
17    487.1 MiB      0.0 MiB          10           prot_chain_list.append(chain.atoms)
18    496.9 MiB      9.8 MiB           1       non_prot = u.select_atoms('not protein')
19                                         
20    623.6 MiB    126.7 MiB           1       unwrap_trans = trans.unwrap(u.atoms)
21    623.6 MiB      0.0 MiB           1       prot_group = GroupHug(*prot_chain_list)
22    623.6 MiB      0.0 MiB           1       center_in_box = trans.center_in_box(u_prot)
23    623.6 MiB      0.0 MiB           1       wrap = trans.wrap(non_prot)
24    623.6 MiB      0.0 MiB           1       rot_fit_trans = trans.fit_rot_trans(u.select_atoms('name CA'), u.select_atoms('name CA'))
25    625.1 MiB      1.5 MiB           1       u.trajectory.add_transformations(*[unwrap_trans, prot_group, center_in_box, wrap, rot_fit_trans])
26    625.1 MiB      0.0 MiB           1       del u_bond
27    478.4 MiB   -146.8 MiB           1       gc.collect()
28    478.4 MiB      0.0 MiB           1       return u

Analysis Performance
- TODO

MDAnalysis / mdanalysis

Benchmark on Parallel In-situ Analysis #3231

Trajectory System

Possible Issues

Line # Hits Time Per Hit % Time Line Contents