ModECI / MDF

This repository contains the source for the MDF specification and Python API
https://mdf.readthedocs.io
Apache License 2.0
37 stars 70 forks source link

Scalability questions #190

Open Helveg opened 2 years ago

Helveg commented 2 years ago

Hi there! I have some questions about scalability of MDF: while connectomes between specific cells can usually be stored as some sort of sparse matrix with the from and to identifiers of the pre and postsynaptic cell, and the from and to location on the cell; this leads to scalability issues when transferring that data to the simulator:

  1. Cells are distributed across compute nodes, most formats require you to iterate the entire dataset to filter out the data about cells on your node. This leads to iteration times of O(N_syn), to filter out 1/Nth of the data in the dataset. This O(N_syn) iteration time assumes you can store the iterated data into a data structure with O(1) lookups. Without O(1) lookup, since you need to query the connections of each cell on your node, you're looking at O(N_syn * (N_cells/nodes)) runtimes to lookup the connections. The number of synapses is the most numerous element of a biophysical neural network. On top of that, most classical O(1) lookup data structures, like a hashtable, have large memory requirements: storing all your data in memory like that on each node is going to limit your scale by memory requirements; NEURON already hits memory limits on HPC at ring networks of 16k cells on 64GB RAM compute nodes, see page 6 of https://arxiv.org/pdf/1901.07454.pdf, imagine having to construct networks with the whole connectome stored in memory, or facing exponential runtime for network construction.
  2. Both NEURON and Arbor (and probably other simulators) need 2 different aspects of the connectivity information: Which connections arrive on this cell? And the inverse, which connections emanate from this cell? How is this adressed, as optimized lookup for one, often hinders the other? Again, processing all the edges in a network seems suboptimal, will there be indices/solutions for both sides of the problem, restrictable to the cells on the node?
  3. Another scaling issue is that tooling that wants to parse your format are often single threaded tools and users who use said tools are on a desktop environment with strict memory limitations, is MDF going to provide streaming, partial lookup of smaller chunks, or optimized data structures for such lookups?
Helveg commented 2 years ago

Additionally: can we expect a publication on MDF that among other things properly investigates scaling and runtime complexity of code that has to deal with it and read/write it?

pgleeson commented 2 years ago

@Helveg As mentioned in #191, the types of network you are referring to here are more in the domain of NeuroML. Eventually there will be full compatibility "under the hood" between MDF and NML, but for now the issue of standardising formats for large scale spiking models is more relevant for NML, and getting NeuroMLlite working well with Arbor, Neuron, Nest, etc. is a more near term goal. Hope that helps.