Trajectory reads by chunk

dotsdl commented 9 years ago

I don't know if this is already the case for some trajectory readers, but it would surely bring better performance if the trajectory iterator read portions of a trajectory by chunks (more than just one frame) at a time. Is this feasable to implement?

richardjgowers commented 9 years ago

It should be pretty possible to change the iter to try and yield the next item in a list of Timestep objects, and if this fails to call a readmany method to refill the list.. so pseudocode

try:
    yield self._buffer[pos]
except IndexError:
    self._buffer = self._readmany()
    yield self._buffer[0]

So a read is only done every n steps. The obvious problem being that it then uses much more memory, but maybe there is something to be gained here.

I think you'd also need to write a special readmany method which capitalises on the fact you're reading a chunk.. And if that's in C you might need a special Timestep object so you can fill the arrays continuously...

Not sure if I'm thinking in the right direction here :D

mnmelo commented 9 years ago

I don't know if we'd be able to work around the GIL, but it'd probably be a task for multithreading, where one of the threads carries on with CPU inensive stuff and another does the I/O precaching in the background.

Some semaphoring would need to be involved, I guess, to prevent reads from the main thread before the I/O one populates the buffer enough.

What do you think?

On Mon, Apr 6, 2015 at 10:52 PM, Richard Gowers notifications@github.com wrote:

It should be pretty possible to change the iter to try and yield the next item in a list of Timestep objects, and if this fails to call a readmany method to refill the list.. so pseudocode

try: yield self._buffer[pos] except IndexError: self._buffer = self._readmany() yield self._buffer[0]

So a read is only done every n steps. The obvious problem being that it then uses much more memory, but maybe there is something to be gained here.

I think you'd also need to write a special readmany method which capitalises on the fact you're reading a chunk.. And if that's in C you might need a special Timestep object so you can fill the arrays continuously...

Not sure if I'm thinking in the right direction here :D

— Reply to this email directly or view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/238#issuecomment-90237637 .

seb-buch commented 9 years ago

Even though the idea is seducing, I fear the ratio gain/hassle may not be that great... I follow Manel on the GIL/CPU/IO juggling and the need for semaphoring. The thing is Cython does not support semaphore yet, so it would require writing everything in pure C to overcome the limitations... with all the portability issues and python 2/3 support that are well-handled by Cython and that we would need to handle directly! In my (rather low) experience, I/O is rarely the bottleneck and I feel that optimizing the CPU part would be a better approach and a better use of dev time.

dotsdl commented 9 years ago

I don't think either threading or use of a semaphore would be necessary to get a performance boost out of this. As it stands, performing an analysis on each frame of a trajectory requires reading one frame at a time from disk for each iteration of the trajectory reader in use. By chunking, we can avoid the cost of context switching for each and every frame, instead reading only every <chunksize> number of frames. The cost is higher memory usage depending on the desired chunksize, but the performance gain could be substantial.

@richardjgowers , your pseudocode example is basically all we need, in principle.

On the user side, I can see the following working for iteration:

u = Universe(TOP, TRAJ)
for ts in u.trajectory(chunksize=100):
    print "do stuff"

I'm almost certain that kind of syntax will work.

mnmelo commented 9 years ago

I think this might work, and certainly more so if the load can be spread between threads because then you also gain whatever performance is lost to I/O. Just as Séb I'm just not sure how much there is to gain. Could we have some sort of benchmarking? Or just test whether that snippet actually speeds up anything?

Ok, instead of hypothesizing I just ran a small unfair test where I compare getting ~50000 atom coordinates from either an .xtc or a list. Getting from file was about 40ms, from the list was 0.85us. I tried the same with 5x more atoms and reading from file scales almost linearly, whereas from the list takes a bit less than linear. This makes the slowdown worse (meaning, the slowdown became slightly larger than 5x).

So yep, there might be room for significant improvement. Of course, all will depend on the application. I think a worst-case scenario is: -Very large number of frames -Very short analysis time per frame -Large number of atoms

In such a case the relative proportion of time spent doing I/O and context switching might be quite large.

orbeckst commented 9 years ago

@mnmelo — in your "unfair benchmark", did you take into account the time to read the data into the list in the first place? My understanding of the proposal is that you want to compare "time to read N individual frames into N TimeSteps" to "time to read N = M*P individual frames into M ManyTimeSteps, each of which holds P individual frames".

Furthermore, @dotsdl you could try implementing this (strike "implementing", I mean "hack together something for benchmarking") for one of the Python-based readers (XYZ or netcdf) and see if it makes a difference. I think the key question is really, where does one get the most return for the amount of time spent working on it, i.e. did we profile the problem?

On that note: It would be really useful if someone stepped up and

identified parts of the code that are performance bottle necks
collected (or commissioned) reports on slow code.

We could then have a wiki page with a list of targets that warrant improvements. Having it all in one place would help prioritizing. Using the Performance tag is already a starting point.

orbeckst commented 9 years ago

Btw, on a second thought this sounds a bit like implementing the ChainReader on a virtual trajectory, where you buffer P steps in memory and view the whole trajectory as a long chain of M individual chunks.

richardjgowers commented 9 years ago

I put together a quick benchmark with a trajectory I had laying about (100 frames of 20k atoms)

The script does a 4000x4000 distance array every timestep:

import MDAnalysis as mda
import cProfile
from MDAnalysis.core.distances import distance_array
import pstats

u = mda.Universe('s-md.psf','new.trz')

def analysis(u):
    ag = u.atoms[:4000]
    ag2 = u.atoms[4000:8000]

    for ts in u.trajectory:
        d = distance_array(ag.positions, ag2.positions, box=u.dimensions)

    return None

cProfile.run('analysis(u)', 'timing.out')

p = pstats.Stats('timing.out')

p.strip_dirs()

p.sort_stats('tottime')

p.print_stats(10)

The results:

Fri Apr 10 00:00:10 2015    timing.out

         7722 function calls in 35.808 seconds

   Ordered by: internal time
   List reduced from 92 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100   35.675    0.357   35.678    0.357 {core.distances.distance_array}
      200    0.042    0.000    0.042    0.000 base.py:168(__getitem__)
        1    0.038    0.038   35.808   35.808 base.py:8(analysis)
      103    0.014    0.000    0.014    0.000 {numpy.core.multiarray.fromfile}
      102    0.014    0.000    0.038    0.000 TRZ.py:288(_read_next_timestep)
      202    0.005    0.000    0.007    0.000 base.py:283(convert_pos_from_native)
      101    0.002    0.000    0.003    0.000 base.py:303(convert_velocities_from_native)
      900    0.002    0.000    0.003    0.000 core.py:271(_veclength)
      300    0.002    0.000    0.004    0.000 core.py:277(_angle)
      402    0.002    0.000    0.002    0.000 {numpy.core.multiarray.array}

So I think the only file read time there is the fromfile call, and the bulk of time is spend in distance_array... but I was expecting the file read to be much more substantial than 0.014? Maybe someone else could try the timing script?

dotsdl commented 9 years ago

@orbeckst I'll give it a shot with the python-based readers as a 0th order test. I'll report back what I get. Since this problem might also scale with number of atoms (my systems are usually ~ 150,000), I'll try out @richardjgowers test script to see what I get.

mnmelo commented 9 years ago

@orbeckst My idea was to find the maximum possible gain if: -frame reading occurs separately from per-frame calculations; -frame iteration is limited by CPU, not frame reading. This would mean that there'd always be a buffer full of frame data to read, and the gain would come mostly from parallelizing the two procedures. I admit it's not a super informative benchmark, save for the indication that there are perhaps significant gains to make if we can dish out IO to a parallel thread.

orbeckst commented 9 years ago

There seem to be a bunch of good ideas in this question and the initial testing benchmarks seem to indicate that some performance improvements might be possible. I try to summarize the discussion so far:

Implementation

Single threaded

The single-threaded version might already give an advantage by reading multiple frames into memory at the level of the reader (either C, cython or Python) and then having an implementation (similar to what @richardjgowers suggests) that dishes out the data from memory.

I assume that this would be transparent to the user and pretty much the only thing the user would have to do is to say how big the chunks ought to be (similar to what @dotsdl suggests). Or is the idea that iteration returns the chunks themselves and then user code has to deal with them?

Parallel

I find @mnmelo 's idea intriguing and it would be very neat to get a parallelized version going in which one (or more) threads (or processes — whatever the implementation will be) fill a buffer while the reader reads from the buffer.

User interface

Some things that are not clear to me:

How is this going to work for random access?
Is this going to be transparent to the user (i.e. same API as before) or will the user get chunks or new Timestep objects whereby they have to deal with the chunks in user code?

dotsdl commented 9 years ago

Or is the idea that iteration returns the chunks themselves and then user code has to deal with them?

The idea is to make this transparent by default, but to give some way of tuning the degree of chunked reads to the user. The iterator would still work exactly as it does (only a single frame active at a time from the reader), but reads from disk would grab more than one frame at a time. By giving some way to change the chunksize, it would allow the user that cares about every last scrap of performance to set it to a value of their choice.

Playing with this style of implementation for the XYZ reader to see if we get any performance gains. However, I'd bet that any significant gains we'd see from this would come from readers that are already fast at per-frame reading, such as the xdrfiles, since this change only cuts down the cost of context-switching.

hainm commented 9 years ago

in my experience with pytraj, frame iterator is the best :D (may be because I did not pay too much attention to chunk iterator).

You can find "chunk_iter" in this notebook here: http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/dev/speed_test_distance.ipynb

orbeckst commented 9 years ago

A real gain would come from readers implementing chunking at the C/cython level. DCDs have the Timeseries/Correl implementation (see src/dcdtimeseries,pyx) which either slurps whole trajectories into memory or does simple analysis at the Cython level.

richardjgowers commented 7 years ago

Has this been done with the MemoryReader?

kain88-de commented 7 years ago

@richardjgowers what do you mean. The MemoryReader brings some performance gains but at the cost of consuming vast amounts of RAM. A real chunk iterator would be the best as it allow to also quickly analyse 100GB trajectories.

richardjgowers commented 7 years ago

The MemoryReader is sort of a single chunk reader, but it won't work with very large files like that you're right

richardjgowers commented 6 years ago

https://github.com/hpcanalytics/paper-hpc-py-parallel-mdanalysis

this has a lot on I/O bottlenecks, maybe chunk reading would have helped here

orbeckst commented 6 years ago

Maybe. If we ever get chunk reading to work, it would be worthwhile to benchmark. The scripts for the draft above will be in https://github.com/hpcanalytics/supplement-hpc-py-parallel-mdanalysis.

The things that really helped were (1) used a HDF5 format and parallel (MPI) reads of HDF5 or (2) chop the trajectory in multiple chunks and have each MPI rank read its own file. The ChainReader kind of works here but has the disadvantage that the ChainReader in each rank still opens and closes all file chunks (it has to build its virtual frame map) – at least that's my understanding, correct me if I am wrong. If you have 100 ranks and 100 chunks then that's 10,000 file open operations more or less at the same time, and Lustre does not seem to like this.

hmacdope commented 2 years ago

This is something @richardjgowers and I have been discussing a lot with some developments as to how this might be implemented / look including the possible use of some optional dependencies such as xarray or zarr. I will raise a new issue and link to here.

MDAnalysis / mdanalysis