Streams - Githubissues

untom commented 9 years ago

Sooner or later, we should think about introducing CUDA streams for our GPU implementation. Side-Effect: Looking at the profiling outputs, across various example the most expensive call we make is usually the set_from_numpy call in the PyCudaHandler. We should be able to completely eliminate the cost of that call completely once we use streams, as the memory-transfers can all be done asynchronously (and we could finally implement a sensible double-buffering on GPUs).

I can think of two ways to add Streams:

Specify Stream for each Call Add a stream=None optional argument to all the handler functions, so that the caller can specify the stream on which to execute. When the stream is not specified, we run on the default stream. We could pass either real cuda-streams, or just stream-IDs (integers). Calls would then maybe look like this:
```
   _h.dot_add_mm(dIa[t], x[t], dWi, transa=True, stream=_h.stream[1])
   _h.dot_add_mm(dFa[t], x[t], dWf, transa=True, stream=_h.stream[2])
   _h.dot_add_mm(dOa[t], x[t], dWo, transa=True, stream=_h.stream[3])
   _h.dot_add_mm(dZa[t], x[t], dWz, transa=True, stream=_h.stream[4])
   ...
   _h.synchronize_all_streams()
```

Add a separate function for specifying streams:

   _h.set_stream(1)
   _h.dot_add_mm(dIa[t], x[t], dWi, transa=True)
   _h.set_stream(2)
   _h.dot_add_mm(dFa[t], x[t], dWf, transa=True)
   _h.set_stream(3)
   _h.dot_add_mm(dOa[t], x[t], dWo, transa=True)
   _h.set_stream(4)
   _h.dot_add_mm(dZa[t], x[t], dWz, transa=True)
   ...
   _h.synchronize_all_streams()

In this short example, option 1 clearly looks better (IMO), but I can see option 2 working out nicely, too.

Another thing to consider is that we might set up some rules about streams. For example, something like "outputs should always be computed on streams 0-4"... or maybe it even makes sense to have different streams for outputs, internals and parameters, so we know which ones we need to synchronize on before starting computations in a new layer (or not, IDK).

flukeskywalker commented 9 years ago

Some handler might need multiple streams, so I guess it needs to be a list of arrays. _h.set_stream([]) can simply set the stream ids and then return the handler. That way it will be:

_h.set_stream(1).dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h.set_stream(2).dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h.set_stream(3).dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h.set_stream(4).dot_add_mm(dZa[t], x[t], dWz, transa=True)

untom commented 9 years ago

Yeah, that looks nice!

Qwlouse commented 9 years ago

How about we (ab)use indexing notation for that:

_h[1].dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h[2].dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h[3].dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h[4].dot_add_mm(dZa[t], x[t], dWz, transa=True)

If _h[0] returns a thin wrapper around the handler you could even assign them to a name if several operations need to use the same stream:

h1 = _h[1]
h1.dot_add_mm(dFa[t], x[t], dW, transa=True)
h1.dot_add_mm(dOa[t], x[t], dW, transa=True)
h1.dot_add_mm(dZa[t], x[t], dW, transa=True)

flukeskywalker commented 9 years ago

Another thing to keep in mind: it'd be nice if streams can be specified for layers too. Then we could run layers in parallel, which would be nice.

Of course, just like one needs to know how many streams are used by an operation while writing a layer implementation, one would also need to know how many streams are used by a layer while building a network. This isn't too much to ask: the docs should take care of it ;)

untom commented 9 years ago

I don't like the abused indexing notation, its a bit too unintuitive for someone who doesn't know the codebase too well. I'd rather do something like

h = _h.get_stream_handler(streamid=1)

where get_stream_handler() returns a childclass of PyCudaHandler that always operates at a specific stream.

Qwlouse commented 9 years ago

Ok, that's a fair point.

What I don't like about _h.set_stream(4).dot_add_mm(...) is that it actually sets the stream, i.e. changes the state of the handler. So all of these would for example use stream 1:

_h.set_stream(1).dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h.dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h.dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h.dot_add_mm(dZa[t], x[t], dWz, transa=True)

We could make some kind of with_stream function that returns a thin wrapper and use it like this:

_h.with_stream(1).dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h.with_stream(2).dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h.with_stream(3).dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h.with_stream(4).dot_add_mm(dZa[t], x[t], dWz, transa=True)

But that of course implies some (small) overhead.

flukeskywalker commented 9 years ago

Alright, to summarize:

1 - We can add stream as an argument to all operations, but then we do it for other handlers which may not use streams, so it's a bit weird.

2 - We can use set_streams() without returning anything. Then we'd do

_h.set_streams([1])
_h.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.set_streams([2])
_h.dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias)  # runs on stream 2

This option means that

You have to set streams before calling operations, and thus you need to know how many streams do those operations expect the handler to provide. In Option 1, you still need to know this, but the operation docs can help you.
You need to call something like _h.set_streams(None) to reset the handler to the default stream after you are done with calling ops. This looks like a problem.

3 - We can use _h.with_streams([...]) to return a wrapper which provides access to those streams. This option retains issue 2a but is better wrt issue 2b:

_h.with_streams([1]).dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.with_streams([2]).dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias)  # runs on default stream

We should pick one and start working on it.

Qwlouse commented 9 years ago

Option 4:

with _h.streams(1):
    _h.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
with _h.streams(2):
    _h.dot_mm(flat_dH, flat_input, out=dW, transa=True)
    _h.sum_t(flat_dH, axis=0, out=dbias)
_h.sum_t(flat_dH, axis=0, out=dbias)  # runs on default stream

Considering issue 2a we could do the following: say the handler internally uses 15 streams (0 - 14), but we group them in five groups of 3 streams [(0, 1, 2), (3, 4, 5), ...]. So when you set a stream in the layer-code it really is a group of 3 streams. With these numbers that would mean that operations internally could use up to 3 streams, and for implementing the layers you could use 5 groups of streams.

flukeskywalker commented 8 years ago

@TimDettmers, this issue may be of interest.

TimDettmers commented 8 years ago

I will look into this and double buffering after I have taken a closer look at the codebase and the PyCUDA API. Double buffering is a bit more complicated, because even with streams there are synchronous parts when you do host -> GPU copies.

flukeskywalker commented 8 years ago

Great! Let us know if you need any clarifications. There is some restructuring of layers going on in a branch right now, but this does not affect the overall architecture and philosophy.

untom commented 8 years ago

Coming back to this: I like option 3 the best. The problem with option 4 is that it gets too wordy too quickly. Especially considering that you'll often want to interleave ops on different streams. The initial example would become:

with _h.set_stream(1):
    _h.dot_add_mm(dIa[t], x[t], dWi, transa=True)
with _h.set_stream(2)
    _h.dot_add_mm(dFa[t], x[t], dWf, transa=True)
with _h.set_stream(3)
    _h.dot_add_mm(dOa[t], x[t], dWo, transa=True)
with _h.set_stream(4)
    _h.dot_add_mm(dZa[t], x[t], dWz, transa=True)

which doubles the line-count AND adds a lot of indentation.

flukeskywalker commented 8 years ago

I agree. I don't have much experience with streams, but @TimDettmers shared some thoughts recently which seemed to suggest that streams won't buy us much, except in special cases, since it already performs ops concurrently when this can be done. @TimDettmers: comments? EDIT: The above does not appear to be true based on a quick look around. Perhaps I misunderstood what was said.

Qwlouse commented 8 years ago

I think this should be post-release. It is important so it shouldn't be rushed. Let's set up a benchmarking suite first, and do a little bit of profiling.

WRT Option3 Vs Option4: Actually those are not exclusive. If with_stream constructs a wrapper anyways we could allow both:

_h.with_streams([1]).dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.with_streams([2]).dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias)  # runs on default stream

with _h.with_streams([1]) as h1:
    h1.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
    h1.dot_mm(flat_dH, flat_input, out=dW, transa=True)
    h1.sum_t(flat_dH, axis=0, out=dbias)

IDSIA / brainstorm

Streams #31