AllenInstitute / sonata

Collaboration between BBP and AIBS
BSD 3-Clause "New" or "Revised" License
53 stars 33 forks source link

Three levels of cell grouping are confusing #3

Open arsenius7 opened 6 years ago

arsenius7 commented 6 years ago

It seems we have three(!) ways to define a circuit with heterogeneous nodes: 1) multiple "nodes" entries in circuit config 2) multiple populations within one H5 file 3) multiple node groups within one population

For a new user like me the benefit of that is non-obvious. At first glance, having homogeneous single-population per H5 file would make things much simpler while still allowing to combine different populations within one circuit / simulation. Same story for the edges. Is there anything I'm missing? Are these extra hierarchy levels unavoidable?

hernando commented 6 years ago

This has been discussed in length and populations and node groups are unavoidable. Points 1 and 2 are implementation details of the agreed data model.

  1. Multiple node entries in the circuit config may be used for subcircuits that have been created independently and are later connected. Think of thalamus, cortex, hippocampus here.
  2. Multiple populations inside the same file are allowed to not enforce a particular file layout. It also allows merging multiple population files in a single one. I cannot give more reasons to justify this other than this flexibility. In any case, we need to have either 1 or 2, and we opted for both.
  3. Node groups are a completely different thing than populations. Inside a brain region, modelers may consider to use qualitatively different cell models which still comprise a single functional block logically, which justifies having a single population for these cells. While the population allows a single index for the cell block, necessary for connectivity, cell groups allow having different properties in each subset, which are necessary for configuring the simulation models.
arsenius7 commented 6 years ago

I have no doubt that all the three appeared there for a reason. What I suggest is to take a look back and to minimize the result. For instance, as you have pointed out

In any case, we need to have either 1 or 2, and we opted for both.

Can we leave only one?


Multiple populations inside the same file are allowed to not enforce a particular file layout.

Not sure I got this part.

arsenius7 commented 6 years ago

I understand (more or less :) ) how node groups are different from populations, but frankly, I don't see how does it justify having extra hierarchy level. Anyway, I see that with that we risk getting back to square 1. The message I'd like to communicate here is that what might be obvious for developers of the format, is not that obvious for the first time users, which undermines its acceptance. So we should either simplify it or provide stronger argument for seemingly excessive flexibility.

hernando commented 6 years ago

Think of the h5 groups as the data space and files as the archiving. In the data space we have populations and node groups, this can't get simpler given the requirements. The spec doesn't enforce how the data space is split in files for archiving (that's what I meant). It's not up to me to decide whether the specification should be more strict to simplify parsing, but I don't think it adds a lot of extra complexity in this case. You may think that a 1:1 population file connection makes it easier look up for populations, but as long as the circuit config is not dictionary and the file names have no relationship to the names of the populations they contain, the difference between having one population per file or multiple is minor in my opinion. The implementation needs to create the dictionary itself and the difference is just a single for loop if code is well structured.

hernando commented 6 years ago

In the data model, the hierarchy has only two levels: populations and node groups, Which is the extra level you are referring two? Remember that file layout is a different problem than data modeling.

Why populations and node groups are needed can be justified in more detail if that's your concern, but that should go in a section that is clearly devoted to just that purpose and not mix it with the description of the data format, otherwise the text becomes cluttered and more complex to read for implementers. If you agree on this, we can rename the issue to the concrete action to be taken and we can discuss who takes care of it in our next meeting with ABI.

arsenius7 commented 6 years ago

In the data model, the hierarchy has only two levels: populations and node groups, Which is the extra level you are referring two? Remember that file layout is a different problem than data modeling.

As I've said, personally I find two levels out of three excessive (multiple node groups per population and multiple populations per HDF5), given that we can use multiple "nodes" in circuit config, but I don't want to reincarnate the previous discussion.

I would find some comfort if we can get rid at least of multiple population per HDF5 in favor of "nodes" in circuit config being a dict. :)

hernando commented 6 years ago

I'm not sure the changes you are requesting are going to be accepted. Adding the population -> (node.h5, node_types.csv) table to the .json is not a minor change and has it's own problems. For example, this way the population name appears twice and needs to be kept consistent. The solution would be to flatten the nodes file and remove the population h5 group, the population name would come from the circuit config then. And this is going back to square one as you said.

get rid at least of multiple population per HDF5 in favor of "nodes" in circuit config being a dict

You seem to imply a dichotomy which to me is false. If a circuit config is changed to be a dictionary, there wouldn't be any issue with having multiple populations per file.

arsenius7 commented 6 years ago

node.h5, node_types.csv

This pair also looks... unconventional for those who are not aware of the history of the discussion, btw. While I can see argument for both HDF5 and CSV representation, the necessity to define node collection by combining the two is beyond my reach (i.e. if the additional flexibility obtained by splitting single entity into two files is really worth it). But that is beyond this issue in any case.

hernando commented 5 years ago

After some discussion we have agreed that for v2 we will consider using implicit indexing for node and edge ids because there is no obvious sensible use case that would require gaps in the population id datasets. Implicit Ids would imply removing the node_id, node_group_index, node_group_id and moving node_type_id to groups (and the same for edges).