Closed GoogleCodeExporter closed 7 years ago
Original comment by richardjgowers
on 5 Feb 2015 at 4:31
I'm rejigging the topology readers into a class based structure for 0.9
Original comment by richardjgowers
on 5 Feb 2015 at 4:32
+1 for faster GRO (and PDB) readers (both as coordinates and topology readers).
Probably cython would already provide a good compromise between ease of
implementation and speed.
This would be a very useful improvement for users (and also make interactive
work much less painful).
Original comment by orbeckst
on 5 Feb 2015 at 5:28
Original comment by orbeckst
on 5 Feb 2015 at 5:28
Original comment by orbeckst
on 10 Feb 2015 at 2:17
@ianmkenney recently benchmarked loading times with various combinations of PDB/GRO/TPR/XTC (with MDA v0.11.0). The results are summarized in a report [1] and there's also a notebook. In particular, when looking at Fig 4a one sees that loading coordinates from PDBs is pretty bad whereas the best performance came from TPR for topology and XTC for coordinates (discounting building the XTC index). There's obviously more to be done (e.g. what does this look like for other trajectory formats) but this should be a starting point for defining areas where we need to improve in order to handle bigger systems.
pd.to_pickle("a_bunch_of_bullshit.df")
:D
I think with these benchmarks, you also need to drill down a little to see where the time is being spent. Universe creation has 3 components:
Using a system of 760k atoms of water (with 0.13.0 admittedly) I get:
In [3]: %timeit u = mda.Universe('out.gro')
1 loops, best of 3: 5.1 s per loop
In [7]: %timeit GROReader('out.gro')
1 loops, best of 3: 1.2 s per loop
In [8]: %timeit GROParser('out.gro').parse()
1 loops, best of 3: 2.81 s per loop
So the split between Parser/Reader/Universe is roughly 55/24/20 %
And then profiling the Parser tells you that half of the Parser time is spent guessing elements!
cProfile.run('GROParser("out.gro").parse()')
6373034 function calls in 5.428 seconds
796624 2.063 0.000 2.109 0.000 core.py:159(guess_atom_element)
So I don't think it's really a format specific problem. Definitely some easy performance gains to be made still.
@orbeckst would id be possible to publish the input files for gromacs as well? Or maybe the data you produced with Gromacs so that it's easier for us to run these tests ourself?
For the big pdb/xtc files you can try out the new large file storage https://github.com/blog/2069-git-large-file-storage-v1-0
For just generating large files to load I usually just do genbox -box 20 20 20
or gmx solvate -box 20 20 20
, creates a 20^3 box of water.
I'll check with @ianmkenney to get more data online.
On 31 Oct, 2015, at 09:19, kain88-de wrote:
@orbeckst would id be possible to publish the input files for gromacs as well? Or maybe the data you produced with Gromacs so that it's easier for us to run these tests ourself?
For the big pdb/xtc files you can try out the new large file storage https://github.com/blog/2069-git-large-file-storage-v1-0
Btw, I am being told that this issue might go away when #363 hits :-).
I did some quick benchmarking with the 3.5M system from here. Looks like we shaved a little more time off from the last release, I'm going to close this, but more performance is always possible.
format: | 0.15.0 release | current develop |
---|---|---|
GRO | 30s | 21s |
PDB | 38s | 24s |
I don't think this is a big surprise if I say that loading a gro file in MDAnalysis can be painfully slow. I think there is place for enhancement and that it will be a good idea to have a fast and reliable solution working for release 1.0.
Original issue reported on code.google.com by
sebastie...@gmail.com
on 5 Feb 2015 at 4:23