speed up GRO/PDB file loading

GoogleCodeExporter commented 9 years ago

I don't think this is a big surprise if I say that loading a gro file in MDAnalysis can be painfully slow. I think there is place for enhancement and that it will be a good idea to have a fast and reliable solution working for release 1.0.

Original issue reported on code.google.com by sebastie...@gmail.com on 5 Feb 2015 at 4:23

Blocked on: #210

GoogleCodeExporter commented 9 years ago

Original comment by richardjgowers on 5 Feb 2015 at 4:31

Now blocked on: #210

GoogleCodeExporter commented 9 years ago

I'm rejigging the topology readers into a class based structure for 0.9

Original comment by richardjgowers on 5 Feb 2015 at 4:32

GoogleCodeExporter commented 9 years ago

+1 for faster GRO (and PDB) readers (both as coordinates and topology readers). 
Probably cython would already provide a good compromise between ease of 
implementation and speed. 

This would be a very useful improvement for users (and also make interactive 
work much less painful).

Original comment by orbeckst on 5 Feb 2015 at 5:28

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 5 Feb 2015 at 5:28

Changed title: speed up GRO/PDB file loading

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 10 Feb 2015 at 2:17

Added labels: Priority-High
Removed labels: Priority-Medium

orbeckst commented 9 years ago

@ianmkenney recently benchmarked loading times with various combinations of PDB/GRO/TPR/XTC (with MDA v0.11.0). The results are summarized in a report [1] and there's also a notebook. In particular, when looking at Fig 4a one sees that loading coordinates from PDBs is pretty bad whereas the best performance came from TPR for topology and XTC for coordinates (discounting building the XTC index). There's obviously more to be done (e.g. what does this look like for other trajectory formats) but this should be a starting point for defining areas where we need to improve in order to handle bigger systems.

Kenney, Ian M.; Beckstein, Oliver (2015): Technical Report: SPIDAL Summer REU 2015: Biomolecular benchmark systems. figshare. doi: 10.6084/m9.figshare.1588804

richardjgowers commented 9 years ago

pd.to_pickle("a_bunch_of_bullshit.df") :D

I think with these benchmarks, you also need to drill down a little to see where the time is being spent. Universe creation has 3 components:

read the topology with a Parser
read the coordinates with a Reader
Universe then finishes up data structures

Using a system of 760k atoms of water (with 0.13.0 admittedly) I get:

In [3]: %timeit u = mda.Universe('out.gro')
1 loops, best of 3: 5.1 s per loop

In [7]: %timeit GROReader('out.gro')
1 loops, best of 3: 1.2 s per loop

In [8]: %timeit GROParser('out.gro').parse()
1 loops, best of 3: 2.81 s per loop

So the split between Parser/Reader/Universe is roughly 55/24/20 %

And then profiling the Parser tells you that half of the Parser time is spent guessing elements!

cProfile.run('GROParser("out.gro").parse()')
         6373034 function calls in 5.428 seconds

   796624    2.063    0.000    2.109    0.000 core.py:159(guess_atom_element)

So I don't think it's really a format specific problem. Definitely some easy performance gains to be made still.

kain88-de commented 9 years ago

@orbeckst would id be possible to publish the input files for gromacs as well? Or maybe the data you produced with Gromacs so that it's easier for us to run these tests ourself?

For the big pdb/xtc files you can try out the new large file storage https://github.com/blog/2069-git-large-file-storage-v1-0

richardjgowers commented 9 years ago

For just generating large files to load I usually just do genbox -box 20 20 20 or gmx solvate -box 20 20 20, creates a 20^3 box of water.

orbeckst commented 9 years ago

I'll check with @ianmkenney to get more data online.

On 31 Oct, 2015, at 09:19, kain88-de wrote:

@orbeckst would id be possible to publish the input files for gromacs as well? Or maybe the data you produced with Gromacs so that it's easier for us to run these tests ourself?

For the big pdb/xtc files you can try out the new large file storage https://github.com/blog/2069-git-large-file-storage-v1-0

orbeckst commented 8 years ago

Btw, I am being told that this issue might go away when #363 hits :-).

richardjgowers commented 7 years ago

I did some quick benchmarking with the 3.5M system from here. Looks like we shaved a little more time off from the last release, I'm going to close this, but more performance is always possible.

format:	0.15.0 release	current develop
GRO	30s	21s
PDB	38s	24s

MDAnalysis / mdanalysis

speed up GRO/PDB file loading #212