Early GPU Performance Analysis [5]

bcumming commented 7 years ago

The GPU backend is has been validated against the multicore backend for all of the validation tests and the miniapp. However, these tests take much longer to run on the GPU than the CPU.

The main reason for this is many small copies, typically of a single floating point value, between host and device memory whenever

a probe sample is taken
spike detection is performed
an iclamp contribution is added to the current
an event is delivered to a synapse

These issues should be addressed before we start to benchmark and optimize the core GPU algorithms

This task has two steps

make a detailed list of all locations where these transfers are affecting performance [2]
propose some designs for removing the memcopies [3]

This task will not involve implementation of the design changes, which will be done in a later task after this analysis is complete.

bcumming commented 7 years ago

1: Taxonomy of host-device copies

Sampling

The sampling interface allows the user to record voltage and current values at a user-specified location at a user-specified frequency. The call to cell_.probe(...) below copies a single value from device to host memory each time it is called:

https://github.com/eth-cscs/nestmc-proto/blob/97e17b187e68a3253fde6c668918605dea2f5d3b/src/cell_group.hpp#L96

Events

The event system has two calls that use host-device copies:

Spike detection

The test() call copies a single voltage value to host:

https://github.com/eth-cscs/nestmc-proto/blob/97e17b187e68a3253fde6c668918605dea2f5d3b/src/cell_group.hpp#L121

Event delivery

https://github.com/eth-cscs/nestmc-proto/blob/97e17b187e68a3253fde6c668918605dea2f5d3b/src/cell_group.hpp#L129

When an event is delivered the net_receive method on the target synapse is called. Any access to the synapse state in the net_receive will involve a H2D, D2H or both, depending on whether the operation is read, write or update.

Stimulii

We have one type of stimuli, equivalent to NEURON's iclamp. When a stimulus injects a nonzero current, two D2H and one H2D transfer is required:

https://github.com/eth-cscs/nestmc-proto/blob/97e17b187e68a3253fde6c668918605dea2f5d3b/src/fvm_multicell.hpp#L667

Testing for physically realistic solution

This was a quick hack that tests whether the voltage at the soma of the first cell in each cell group is within a "reasonable" range after each internal time step of the cell_group. This can be disabled without much fuss (or replaced with a better sanity test)

https://github.com/eth-cscs/nestmc-proto/blob/97e17b187e68a3253fde6c668918605dea2f5d3b/src/cell_group.hpp#L93

bcumming commented 7 years ago

Proposed solution : Spike Detection

Currently the only source of spikes in our models is from spike detectors that are triggered when a voltage threashold is exceeded. The solution should handle spike sources in general:

spike detectors
spike chain generators
something like NMODL point processes that generate net_event()

Spike detectors:

the hardware-specific backend provides an interface that
- front end (i.e. the cell_group) can describe which compartments/locations must be watched (with threshold)
- front end can periodically request a list of recorded "threshold crossings"
the front end is responsible for translating the threshold crossings into pre-synaptic spikes

Spike generators etc:

provide a spike generator abstraction
library provides some specializations like Poisson etc...
C++ users can write their own specializations
Python front end will provide an interface for specifying sequences of spikes

net_event()

Using net_send(), net_event(), and net_move() to modify events in the local event queue is a bit messy in NEURON, because it conflates local state machine transitions with external events. We should revisit how such state machines are implemented in NestMC.

summary

For now we should focus on the first two (detectors and generators).

bcumming commented 7 years ago

Proposed solution: Stimuli

Implement them as point processes. They describe the injected current in nA, and hence use the same scaling factor to convert from current to current-density as synapses.

Before implementing this we should address bug #20, to first finalize the per-mechanism instance scaling factor (which will also have to implemented for density channels to address the bug).

bcumming commented 7 years ago

Proposed Solution: event delivery

This is the most difficult, because it requires a queue on the backend side, which may have to be processed in parallel (e.g. on the GPU).

The event delivery has to be moved into the back end, though the implementation of delivery for the multicore backend will not change much. Unlike the GPU, which will require a bit of creativity!

The backend has to provide an interface for pre-processing events that can be called asynchronously

to allow full overlap of event wrangling and computation
on the gpu this might mean sorting and packing events on host then copying to the GPU memory

GPU proposed solution

sorting events on host

step 1

Sort all events by the following pseudo-code lambda

[](event l, event r){ return l.target==r.target ? l.time<r.time : l.target<r.target; }

The events will be sorted in order of target index, then in order of delivery time

step 2

Partition the sorted list into one partition per target

step 3

Sort the delivery times to generate a total ordering of event delivery.

step 4

Copy the partition, sorted events and event delivery ordering to gpu

on the GPU

Before each time step, process all instances of a target in parallel, with one thread per target.

e = next event for target
if e.deliver_time <= t + next_delivery_interval
    target.net_receive(e)
end

The value of next_delivery_interval is determined from the ordered list of event delivery times calculated in step 4 above.

halfflat commented 7 years ago

Review

Taxonomy

Regarding the sampling memory transfers, it's not so much the call to cell_.probe(...) that is the problem, but the way that cell_.probe(...) is currently implemented. That is, there may be no need to change the interface between the cell group and lowered cell to address this problem.

The responsibility for the offending transfer for spike detection is shared between https://github.com/eth-cscs/nestmc-proto/blob/97e17b187e68a3253fde6c668918605dea2f5d3b/src/spike_source.hpp#L29 and the implementation of detector_voltage(...) on the lowered cell.

Spike generation

Note that we already have the interface for the front end (cell description and thence cell_group) to inform the lowered cell of the locations and thresholds for spike detection; the change need only be limited to the process of polling for generated spikes.

We should be clear that a spike in this context is a pre-synaptic, delayed delivery, one-to-many event. There may also be a need for triggering events directly on point mechanisms, akin to post-synaptic spike deliveries or to mechanism state changes currently described by net_event and friends.

Event delivery

If on one hand we keep the cell group driving the lowered cell one dt at a time, the cell group can keep using the same deliver_event method on the lowered cell, and the lowered cell can arrange back end enqueuing when its advance() method is called. In this instance, every event visible to the target would be eligible for being processed with net_receive().

On the other hand, we can pass all upcoming events to the back end when the cell group's enqueue_events method is called, if we pass responsibility for integration over multiple time steps to the lowered cell, which in turn can implement a 'staggered' time for each cell in its domain. In this case, the target will need to consume only those events corresponding to its current integration time step, which in principle should align with the pending event times.

bcumming commented 7 years ago

Spike Generation

I agree with you entirely.

bcumming commented 7 years ago

Event Delivery

The main bottleneck at the moment is copying event information from host to device on each time step. Copying events before they are needed at the start of each "integration period" avoids this.

This could be used for either the cell_group-driven stepping or lowered_cell-driven time stepping. I would try to implement the first approach, because that is closest to what we currently have. But long term I think that passing responsibility for time stepping to the lowered_cell is very promising.

arbor-sim / arbor