AIND Ephys Pipeline with Kilosort2.5


Electrophysiology analysis pipeline using Kilosort2.5 via SpikeInterface.

The pipeline is based on Nextflow and it includes the following steps:

Each step is run in a container and can be deployed on several platforms. See the Local deplyment and SLURM deployment sections for more details.


Currently, the pipeline supports the following input data types:

For more information on how to select the input mode and set additional parameters, see the Local deployment - Additional parameters section.


The output of the pipeline is saved to the RESULTS_PATH. Since the output is produced using SpikeInterface, it is recommended to go through its documentation to understand how to easily load and interact with the data:

The output includes the following files and folders:


This folder contains the output of preprocessing, including preprocessed JSON files associated to each stream and motion folders containing the estimated motion. The preprocessed JSON files can be used to re-instantiate the recordings, provided that the raw data folder is mapped to the same location as the input of the pipeline.

In this case, the preprocessed recording can be loaded as a spikeinterface.BaseRecording with:

import spikeinterface as si

recording_preprocessed = si.load_extractor("path-to-preprocessed.json", base_folder="path-to-raw-data-parent")

The motion folders can be loaded as:

import spikeinterface.preprocessing as spre

motion = spre.load_motion("path-to-motion-folder")

They include the motion, temporal_bins, and spatial_bins fields, which can be used to visualize the estimated motion.


This folder contains the raw spike sorting outputs from Kilosort2.5 for each stream.

It can be loaded as a spikeinterface.BaseSorting with:

import spikeinterface as si

sorting_raw = si.load_extractor("path-to-spikesorted-folder")


This folder contains the output of the post-processing for each stream. It can be loaded as a spikeinterface.WaveformExtractor with:

import spikeinterface as si

waveform_extractor = si.load_waveforms("path-to-postprocessed-folder", with_recording=False)

The waveform_extractor includes many computed extensions. This example shows how to load some of them:

unit_locations = we.load_extension("unit_locations").get_data()
# unit_locations is a np.array with the estimated locations

qm = we.load_extension("quality_metrics").get_data()
# qm is a pandas.DataFrame with the computed quality metrics


This folder contains the curated spike sorting outputs, after unit deduplication, quality-metric curation and automatic unit classification.

It can be loaded as a spikeinterface.BaseSorting with:

import spikeinterface as si

sorting_curated = si.load_extractor("path-to-curated-folder")

The sorting_curated object contains the following curation properties (which can be retrieved with sorting_curated.get_property(property_name)):


This folder contains the generated NWB files.


This JSON file containes the generated Figurl links for each stream, including a timeseries and a sorting_summary view.


This JSON file logs all the processing steps, parameters, and execution times.


All files generated by Nextflow are saved here


Some steps of the pipeline accept additional parameters, that can be passed as follows:

--{step_name}_args "{args}"

The steps that accept additional arguments are:


  --concatenate         Whether to concatenate recordings (segments) or not. Default: False
  --input {aind,spikeglx,nwb}
                        Which 'loader' to use. Default 'aind'


  --debug               Whether to run in DEBUG mode
  --denoising {cmr,destripe}
                        Which denoising strategy to use. Can be 'cmr' or 'destripe'. Default 'cmr'
                        Whether to remove out channels
                        Whether to remove bad channels
  --max-bad-channel-fraction MAX_BAD_CHANNEL_FRACTION
                        Maximum fraction of bad channels to remove. If more than this fraction, processing is skipped
  --motion {skip,compute,apply}
                        How to deal with motion correction. Can be 'skip', 'compute', or 'apply'. Default 'compute'
  --motion-preset {nonrigid_accurate,kilosort_like,nonrigid_fast_and_accurate}
                        What motion preset to use. Can be 'nonrigid_accurate', 'kilosort_like', or 'nonrigid_fast_and_accurate'. Default "nonrigid_fast_and_accurate"
  --debug-duration DEBUG_DURATION
                        Duration of clipped recording in debug mode. Default is 30 seconds. Only used if debug is enabled


  --backend {hdf5,zarr}
                        NWB backend. It can be either 'hdf5' or 'zarr'. Default 'zarr'

In Nextflow, the The -resume argument enables the caching mechanism.

Local deployment


To deploy locally, you need to install:

Please checkout the Nextflow and Docker installation instructions.

To install and configure figurl, you need to follow these instructions to setup [kachery-cloud]():

  1. On your local machine, run pip install kachery-cloud
  2. Run kachery-cloud-init, open the printed URL link and login with your GitHub account
  3. Go to and create a new Client:
    • Click on the Client tab on the left
    • Add a new client (you can choose any label)
  4. Set kachery-cloud credentials on your local machine:
    • Click on the newly created client
    • Set the KACHERY_CLOUD_CLIENT_ID environment variable to the Client ID content
    • Set the KACHERY_CLOUD_PRIVATE_KEY environment variable to the Ptivate Key content
    • (optional) If using a custom Kachery zone, set KACHERY_ZONE environment variable to your zone

By default, kachery-cloud will use the default zone, which is hosted by the Flatiron institute. If you plan to use this service extensively, it is recommended to create your own kachery zone.


Clone this repo (git clone and go to the pipeline folder. You will find a This nextflow script is accompanied by the nextflow_local.config and can run on local workstations/machines.

To invoke the pipeline you can run the following command:

NXF_VER=22.10.8 DATA_PATH=$PWD/../data RESULTS_PATH=$PWD/../results \
    nextflow -C nextflow_local.config run \
    -log $RESULTS_PATH/nextflow/nextflow.log \
    --n_jobs 8 -resume

The DATA_PATH specifies the folder where the input files are located. The RESULT_PATH points to the output folder, where the data will be saved. The --n_jobs argument specifies the number of parallel jobs to run.

Additional parameters can be passed as described in the Parameters section.

Example run command

As an example, here is how to run the pipeline on a SpikeGLX dataset in debug mode on a 120-second snippet of the recording with 16 jobs:

NXF_VER=22.10.8 DATA_PATH=path/to/data_spikeglx RESULTS_PATH=path/to/results_spikeglx \
    nextflow -C nextflow_local.config run --n_jobs 16 \
    --job_dispatch_args "--input spikeglx" --preprocessing_args "--debug --debug-duration 120"

Caveats of local deployment

While the pipeline can be deployed locally on a workstation or a server, it is recommended to to deploy it on a cluster or on a batch processing system (e.g., AWS batch). When deploying locally, the most recource-intensive processes (preprocessing, spike sorting, postprocessing) are not parallelized to avoid overloading the system. This is achieved by setting the maxForks 1 directive in such processes.

SLURM deployment

To deploy on a SLURM cluster, you need to have access to a SLURM cluster and have the Nextflow and Singularity/Apptainer installed. To use Figurl cloud visualizations, follow the same steps descrived in the Local deployment - Requirements section and set the KACHERY environment variables.

Then, you can submit the pipeline to the cluster similarly to the Local deplyment, but wrapping the command into a script that can be launched with sbatch.

To avoid downloading the Docker images in the current location (usually the home folder), you can set the NXF_SINGULARITY_CACHEDIR environment variable to a different location.

You can use the script as a template to submit the pipeline to your cluster.

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4GB
#SBATCH --time=2:00:00
### change {your-partition} to the partition/queue on your cluster
#SBATCH --partition={your-partition}

# modify this section to make the nextflow command available to your environment
# e.g., using a conda environment with nextflow installed
conda activate env_nf


    -C $PIPELINE_PATH/pipeline/nextflow_slurm.config \
    -log $RESULTS_PATH/nextflow/nextflow.log \
    run $PIPELINE_PATH/pipeline/ \
    -work-dir $WORKDIR \
    --preprocessing_args "--debug --debug-duration 120" \ # additional parameters

You should change the --partition parameter to match the partition you want to use on your cluster and point to the correct paths and parameters.

Then, you can submit the script to the cluster with:


Create a custom layer for data ingestion

The default job-dispatch step only supports loading data from AIND folders, SpikeGLX folders, and NWB files.

To ingest other types of data, you can create a similar repo and modify the way that the job list is created (see the job dispatch README for more details).

Then you can create a modified job_dispatch process to point to your custom job dispatch repo.