Closed bcumming closed 7 years ago
The sampling interface allows the user to record voltage and current values at a user-specified location at a user-specified frequency. The call to cell_.probe(...)
below copies a single value from device to host memory each time it is called:
The event system has two calls that use host-device copies:
The test()
call copies a single voltage value to host:
When an event is delivered the net_receive
method on the target synapse is called. Any access to the synapse state in the net_receive
will involve a H2D, D2H or both, depending on whether the operation is read, write or update.
We have one type of stimuli, equivalent to NEURON's iclamp
. When a stimulus injects a nonzero current, two D2H and one H2D transfer is required:
This was a quick hack that tests whether the voltage at the soma of the first cell in each cell group is within a "reasonable" range after each internal time step of the cell_group
. This can be disabled without much fuss (or replaced with a better sanity test)
Currently the only source of spikes in our models is from spike detectors that are triggered when a voltage threashold is exceeded. The solution should handle spike sources in general:
net_event()
cell_group
) can describe which compartments/locations must be watched (with threshold)net_send()
, net_event()
, and net_move()
to modify events in the local event queue is a bit messy in NEURON, because it conflates local state machine transitions with external events. We should revisit how such state machines are implemented in NestMC.For now we should focus on the first two (detectors and generators).
Implement them as point processes. They describe the injected current in nA, and hence use the same scaling factor to convert from current to current-density as synapses.
Before implementing this we should address bug #20, to first finalize the per-mechanism instance scaling factor (which will also have to implemented for density channels to address the bug).
This is the most difficult, because it requires a queue on the backend side, which may have to be processed in parallel (e.g. on the GPU).
The event delivery has to be moved into the back end, though the implementation of delivery for the multicore backend will not change much. Unlike the GPU, which will require a bit of creativity!
The backend has to provide an interface for pre-processing events that can be called asynchronously
GPU proposed solution
Sort all events by the following pseudo-code lambda
[](event l, event r){ return l.target==r.target ? l.time<r.time : l.target<r.target; }
The events will be sorted in order of target index, then in order of delivery time
Partition the sorted list into one partition per target
Sort the delivery times to generate a total ordering of event delivery.
Copy the partition, sorted events and event delivery ordering to gpu
Before each time step, process all instances of a target in parallel, with one thread per target.
e = next event for target
if e.deliver_time <= t + next_delivery_interval
target.net_receive(e)
end
The value of next_delivery_interval
is determined from the ordered list of event delivery times calculated in step 4 above.
Regarding the sampling memory transfers, it's not so much the call to cell_.probe(...)
that is the problem, but the way that cell_.probe(...)
is currently implemented. That is, there may be no need to change the interface between the cell group and lowered cell to address this problem.
The responsibility for the offending transfer for spike detection is shared between
https://github.com/eth-cscs/nestmc-proto/blob/97e17b187e68a3253fde6c668918605dea2f5d3b/src/spike_source.hpp#L29
and the implementation of detector_voltage(...)
on the lowered cell.
Note that we already have the interface for the front end (cell description and thence cell_group
) to inform the lowered cell of the locations and thresholds for spike detection; the change need only be limited to the process of polling for generated spikes.
We should be clear that a spike in this context is a pre-synaptic, delayed delivery, one-to-many event. There may also be a need for triggering events directly on point mechanisms, akin to post-synaptic spike deliveries or to mechanism state changes currently described by net_event
and friends.
If on one hand we keep the cell group driving the lowered cell one dt
at a time, the cell group can keep using the same deliver_event
method on the lowered cell, and the lowered cell can arrange back end enqueuing when its advance()
method is called. In this instance, every event visible to the target would be eligible for being processed with net_receive()
.
On the other hand, we can pass all upcoming events to the back end when the cell group's enqueue_events
method is called, if we pass responsibility for integration over multiple time steps to the lowered cell, which in turn can implement a 'staggered' time for each cell in its domain. In this case, the target will need to consume only those events corresponding to its current integration time step, which in principle should align with the pending event times.
I agree with you entirely.
The main bottleneck at the moment is copying event information from host to device on each time step. Copying events before they are needed at the start of each "integration period" avoids this.
This could be used for either the cell_group
-driven stepping or lowered_cell
-driven time stepping. I would try to implement the first approach, because that is closest to what we currently have. But long term I think that passing responsibility for time stepping to the lowered_cell
is very promising.
The GPU backend is has been validated against the multicore backend for all of the validation tests and the miniapp. However, these tests take much longer to run on the GPU than the CPU.
The main reason for this is many small copies, typically of a single floating point value, between host and device memory whenever
These issues should be addressed before we start to benchmark and optimize the core GPU algorithms
This task has two steps
This task will not involve implementation of the design changes, which will be done in a later task after this analysis is complete.