MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.32k stars 650 forks source link

speed up GRO/PDB file loading #212

Closed GoogleCodeExporter closed 7 years ago

GoogleCodeExporter commented 9 years ago

I don't think this is a big surprise if I say that loading a gro file in MDAnalysis can be painfully slow. I think there is place for enhancement and that it will be a good idea to have a fast and reliable solution working for release 1.0.

Original issue reported on code.google.com by sebastie...@gmail.com on 5 Feb 2015 at 4:23

GoogleCodeExporter commented 9 years ago

Original comment by richardjgowers on 5 Feb 2015 at 4:31

GoogleCodeExporter commented 9 years ago
I'm rejigging the topology readers into a class based structure for 0.9

Original comment by richardjgowers on 5 Feb 2015 at 4:32

GoogleCodeExporter commented 9 years ago
+1 for faster GRO (and PDB) readers (both as coordinates and topology readers). 
Probably cython would already provide a good compromise between ease of 
implementation and speed. 

This would be a very useful improvement for users (and also make interactive 
work much less painful).

Original comment by orbeckst on 5 Feb 2015 at 5:28

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 5 Feb 2015 at 5:28

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 10 Feb 2015 at 2:17

orbeckst commented 9 years ago

@ianmkenney recently benchmarked loading times with various combinations of PDB/GRO/TPR/XTC (with MDA v0.11.0). The results are summarized in a report [1] and there's also a notebook. In particular, when looking at Fig 4a one sees that loading coordinates from PDBs is pretty bad whereas the best performance came from TPR for topology and XTC for coordinates (discounting building the XTC index). There's obviously more to be done (e.g. what does this look like for other trajectory formats) but this should be a starting point for defining areas where we need to improve in order to handle bigger systems.

  1. Kenney, Ian M.; Beckstein, Oliver (2015): Technical Report: SPIDAL Summer REU 2015: Biomolecular benchmark systems. figshare. doi: 10.6084/m9.figshare.1588804
richardjgowers commented 9 years ago

pd.to_pickle("a_bunch_of_bullshit.df") :D

I think with these benchmarks, you also need to drill down a little to see where the time is being spent. Universe creation has 3 components:

Using a system of 760k atoms of water (with 0.13.0 admittedly) I get:

In [3]: %timeit u = mda.Universe('out.gro')
1 loops, best of 3: 5.1 s per loop

In [7]: %timeit GROReader('out.gro')
1 loops, best of 3: 1.2 s per loop

In [8]: %timeit GROParser('out.gro').parse()
1 loops, best of 3: 2.81 s per loop

So the split between Parser/Reader/Universe is roughly 55/24/20 %

And then profiling the Parser tells you that half of the Parser time is spent guessing elements!

cProfile.run('GROParser("out.gro").parse()')
         6373034 function calls in 5.428 seconds

   796624    2.063    0.000    2.109    0.000 core.py:159(guess_atom_element)

So I don't think it's really a format specific problem. Definitely some easy performance gains to be made still.

kain88-de commented 9 years ago

@orbeckst would id be possible to publish the input files for gromacs as well? Or maybe the data you produced with Gromacs so that it's easier for us to run these tests ourself?

For the big pdb/xtc files you can try out the new large file storage https://github.com/blog/2069-git-large-file-storage-v1-0

richardjgowers commented 9 years ago

For just generating large files to load I usually just do genbox -box 20 20 20 or gmx solvate -box 20 20 20, creates a 20^3 box of water.

orbeckst commented 9 years ago

I'll check with @ianmkenney to get more data online.

On 31 Oct, 2015, at 09:19, kain88-de wrote:

@orbeckst would id be possible to publish the input files for gromacs as well? Or maybe the data you produced with Gromacs so that it's easier for us to run these tests ourself?

For the big pdb/xtc files you can try out the new large file storage https://github.com/blog/2069-git-large-file-storage-v1-0

orbeckst commented 8 years ago

Btw, I am being told that this issue might go away when #363 hits :-).

richardjgowers commented 7 years ago

I did some quick benchmarking with the 3.5M system from here. Looks like we shaved a little more time off from the last release, I'm going to close this, but more performance is always possible.

format: 0.15.0 release current develop
GRO 30s 21s
PDB 38s 24s