Test out Arbor as simulation backend

espenhgn commented 3 years ago

Is your feature request related to a problem? Please describe. Replacing LFPy by Arbor+LFPykit for simulations of passive multicompartment neuron models with current-based synapses in Population.cellsim() ought to be a relatively low hanging fruit, that could very well speed up simulations substantially over the equivalent simulations now relying on LFPy+NEURON (in particular with GPU access).

Steps:

[x] Incorporate current based synapse model .mod files in Arbor (https://github.com/espenhgn/arbor/tree/isyn)
[x] Adapt LFPykit+Arbor+.swc morphology file example at https://github.com/LFPy/LFPykit/blob/master/examples/Example_Arbor_swc.ipynb to test the methodology for individual neurons
[x] Compare with equivalent simulation using LFPy+NEURON
[x] Convert morphology files to .swc or similar formats supported by Arbor
[x] Assure simulation above is indeed faster than the equivalent NEURON based simulation
[ ] Incorporate into hybridLFPy

Notes/possible issues that may occur:

Arbor requires a two-step process, we first need to instantiate a cell recipe without synapses to find the segments belonging to each CV. The segment/CV areas and locations (loc_sets?) must then be communicated to all RANKs before cell models for actual simulations are performed as these will determine the probabilities of synaptic connections
We have used .hoc files usually for morphologies (in examples). Need to convert these somehow or do some archeology to find the original .swc or neurolucida files.

espenhgn commented 3 years ago

Back-conversion of morphologies to .swc format should be doable using https://github.com/JustasB/hoc2swc

espenhgn commented 3 years ago

Added file Example_Arbor_swc_hybridLFPy.ipynb (https://github.com/INM-6/hybridLFPy/blob/fix-67/examples/Example_Arbor_swc_hybridLFPy.ipynb) addressing pt 2 above.

espenhgn commented 3 years ago

Single-node scaling test on jureca-dc comparing Arbor and the usual hybridLFPy implementation: The “run” times correspond to total times spent per task calling arbor.simulation.run() and the equivalent LFPy.Cell.simulate() over and over. The “pop” times include additional overhead like setting up the cells, determining synapse locations, activation times, compute LFPs etc. The corresponding files are https://github.com/INM-6/hybridLFPy/tree/fix-67/examples/benchmarks where the file example_brunel_arbor.py is the Arbor+lfpykit-implementation of example_brunel.py using the “old” hybridLFPy. Output is comparable; the numbers of compartments/CVs are comparable between the NEURON/LFPy and Arbor simulations as max segment/CV length is the same value.

espenhgn commented 3 years ago

Multinode performance testing w. increased network size and longer simulation duration (2000 ms), two CPUs (virtual) per MPI task, multithread execution with Arbor via

        context = arbor.context(2, None)
        domains = arbor.partition_load_balance(recipe, context)

Arbor (https://github.com/espenhgn/arbor/tree/isyn) config:

{'mpi': False,
 'mpi4py': False,
 'gpu': False,
 'vectorize': True,
 'version': '0.5.3-dev',
 'source': '2021-07-31T15:18:36+02:00 9503801c293b6933e2b186d8fbf7a1245366e876',
 'arch': 'znver1'}

bcumming commented 3 years ago

Hello @espenhgn.

Am I correct that the following steps are performed for simulating a model with NCELLS cells:

Step 1 initialise MPI

Step 2 Create and run a Nest Brunel model with NCELLS cells to generate input spike information for detailed cell models

Step 3

Given the number of cells per MPI rank is NLOCAL_CELLS = NCELLS / NRANKS,
for each cable cell assigned to this rank perform a single-cell-model simulation with Arbor.

Then I have the following observations:

the arbor.simulation object only contains a model with 1 cell, so multithreading isn't required. The CPU back end of Arbor parallelises cells over threads, so a one cell model will only run on one thread
- The most efficient way to run would be one MPI rank per physical core, with arbor.context(1, None).
You are running on Juwels-DC, which has 2 sockets of 64 core AMD EPYC CPUs
- so you are running 128 MPI ranks per node, where each node has 2 "hyperthreaded" cores.

And I have the following questions:

what is the srun/sbatch parameters you use?
does NTASKS in the plots correspond to the number of MPI ranks?
- If my assumptions above are correct, NTASKS=2^9 would run on 4 nodes?

espenhgn commented 3 years ago

Hi @bcumming

Hello @espenhgn.

Am I correct that the following steps are performed for simulating a model with NCELLS cells:

Step 1 initialise MPI

Yes, indeed.

Step 2 Create and run a Nest Brunel model with NCELLS cells to generate input spike information for detailed cell models

Yes, that's the general idea behind hybridLFPy that simulations of network activity (spikes) is performed before the simulations of the detailed cell simulations needed for extracellular signals. For the sake of simplicity the arbor_reproducer.py script creates random synaptic activation times for the arbor simulation part using scipy.stats which are subsequently fed to the class Recipe instance around here: https://github.com/INM-6/hybridLFPy/blob/ba6956be902d1939f42966e43801494aeca1fbe8/examples/benchmarks/arbor_reproducer.py#L317 (leaving NEST out of the equation altogether).

Step 3

Given the number of cells per MPI rank is NLOCAL_CELLS = NCELLS / NRANKS, for each cable cell assigned to this rank perform a single-cell-model simulation with Arbor.

Yes, each rank should run each subset of cells distributed in a round-robin manner. Each cell simulation on each rank is performed in succession in the case that NLOCAL_CELLS > NRANKS. The ArborPopulation.run() method (https://github.com/INM-6/hybridLFPy/blob/ba6956be902d1939f42966e43801494aeca1fbe8/examples/benchmarks/arbor_reproducer.py#L343) just iterates over the ArborPopulation.cellsim(cellindex) method (https://github.com/INM-6/hybridLFPy/blob/ba6956be902d1939f42966e43801494aeca1fbe8/examples/benchmarks/arbor_reproducer.py#L199).

Then I have the following observations:

the arbor.simulation object only contains a model with 1 cell, so multithreading isn't required. The CPU back end of Arbor parallelises cells over threads, so a one cell model will only run on one thread

Ok, thanks for the confirmation. I was curious whether or not Arbor was somehow distributing single-cell simulations analogous to the multisplit method of Neuron.

The most efficient way to run would be one MPI rank per physical core, with arbor.context(1, None).

You are running on Juwels-DC, which has 2 sockets of 64 core AMD EPYC CPUs

I'm running on Jureca-DC as well as JUSUF, both in Jülich. These are both using two 64 core AMD EPYC CPUs per node.

so you are running 128 MPI ranks per node, where each node has 2 "hyperthreaded" cores.

Yes, for the benchmark curves above at most 128 MPI ranks are created per node.

And I have the following questions:

what is the srun/sbatch parameters you use?

These are typically (see run_arbor_reproducer.py https://github.com/INM-6/hybridLFPy/blob/fix-67/examples/benchmarks/run_arbor_reproducer.py):

#!/bin/bash
##################################################################
#SBATCH --account $ACCOUNT
#SBATCH --job-name <MD5>
#SBATCH --time 0:04:01
#SBATCH -o logs/<MD5>.txt
#SBATCH -e logs/<MD5>_error.txt
#SBATCH --ntasks NTASKS
#SBATCH --cpus-per-task NTHREADS
##################################################################
# from here on we can run whatever command we want
# unset DISPLAY # DISPLAY somehow problematic with Slurm
srun --mpi=pmi2 python -u arbor_reproducer.py <MD5>

The NTHREADS value was parsed to arbor.context() in the simulation script as well (but as you confirmed this should then remain at its default value "1").

The <MD5> value is just a hash value I use to keep track of unique parameter combinations for each individual simulation in this case.

does NTASKS in the plots correspond to the number of MPI ranks?

Yes.

If my assumptions above are correct, NTASKS=2^9 would run on 4 nodes?

Yes, for the curves shown above in this issue.

INM-6 / hybridLFPy

Test out Arbor as simulation backend #67