BlueBrain / libsonata

A python and C++ interface to the SONATA format
https://libsonata.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
11 stars 12 forks source link

Return a structured array for the spikes in python #285

Closed jorblancoa closed 1 year ago

jorblancoa commented 1 year ago
ferdonline commented 1 year ago

Thanks @jorblancoa. I think the interface is really nice. I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library". To recap we are converting from two columns in hdf5 to a vector of struct, then here again to columns. Besides, in get(), there is an unconditional copy of all spikes for a filtering that is hardly requested. report_reader.cpp#L136 Maybe we should go back to requirements. Do we need the get() interface as is? is it really worth to have columnar data for data which is invariably two columns?

mgeplf commented 1 year ago

I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library".

I expect there needs to be a new c++ method that builds these separated vectors which is waht I was getting at here: https://github.com/BlueBrain/neurodamus/pull/8/files#discussion_r1273085740

Besides, in get(), there is an unconditional copy of all spikes for a filtering that is hardly requested. report_reader.cpp#L136

That copy should be fixed. The filtering, however, is important.

Maybe we should go back to requirements. Do we need the get() interface as is? is it really worth to have columnar data for data which is invariably two columns?

On the C++ side, I think it's fine; keeping that API stable is important. On the python side, a list of tuples is not great, but API stability is more important.

Once we have the new python methods in place, we can start issuing warnings that people may get better performance from switching to it. Then, perhaps, we can deprecate it in the future.

jorblancoa commented 1 year ago

Shall we merge this in order to create a tag together with the LFP changes? @mgeplf

mgeplf commented 1 year ago

I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library".

I think @ferdonline's point of " I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library"." still stands. I think we should do it with a less convoluted data path before it gets committed.

jorblancoa commented 1 year ago

Does that belong to this PR?

mgeplf commented 1 year ago

I would say so; the extra array created by https://github.com/BlueBrain/libsonata/pull/285/files#diff-cc07100b7c7235ddf263fda0d515a7b40ed1fe5b871714e0ab5c1ec5298ddb0dR1193 can be avoided if some internal methods were added to get the id and timestamp separately

jorblancoa commented 1 year ago

I would say so; the extra array created by https://github.com/BlueBrain/libsonata/pull/285/files#diff-cc07100b7c7235ddf263fda0d515a7b40ed1fe5b871714e0ab5c1ec5298ddb0dR1193 can be avoided if some internal methods were added to get the id and timestamp separately

Makes sense, but then the best would be to have them as members in the reportreader.h right? Otherwise the spikes need to be used anyway to filter and then converted to the 2 arrays.

mgeplf commented 1 year ago

Makes sense, but then the best would be to have them as members in the report_reader.h right?

Yeah, I think I understand what you mean and that would make sense.

jorblancoa commented 1 year ago

Since we cant reuse the existing code for filtering and fernando's usecase only needs the 2 raw arrays, would you be ok for that? @ferdonline @mgeplf

mgeplf commented 1 year ago

Don't we still have the convoluted path of the data which was of the main things that needed to be improved? It looks like getArrays calls get, which makes an unconditional copy of all spikes by createSpikes creating pairs from the underlying node_id and timestamps. Back in these pairs are destructured into two new arrays. Can’t we create arrays of only the data that is required (ie: based on the gids/times), so they aren’t large, and create them in the format that is required? Having an Iterator would probably make that easier.

ferdonline commented 1 year ago

@mgeplf IIUC at line 162 there's the fast path. If the client wants to filter data then the copy happens to vector<pair> which IMO is fair since that layout is ideal for sorting together. We could potentially further optimize the case of only filtering without ordering and avoid the copy, but I don't think we have that use case now, so I'd agree to leave as is and touch at a later stage.

mgeplf commented 1 year ago

But we want to be able to filter things; that’s an important use case. It seems weird to me to have users of a library have to know that there is a fast-path if the client wants to sort the data rather than doing the right thing on the library side.

On the neurodamus side, isn’t it useful to only load spikes that are going to be used in the simulation rather than loading all of them? For instance, if the simulation is going to run for 1s, but the spikes file contains 10s worth of spikes? What about the case where the simulation is only doing a subset of node_ids? Wouldn’t filtering the data before it all gets loaded be valuable? These are the sorts of optimizations that can be done here, and then the would benefit everyone using the library.

They’re also useful right now, as when people do analysis, a subset of the data is usually looked at.

jorblancoa commented 1 year ago

Don't we still have the convoluted path of the data which was of the main things that needed to be improved? It looks like getArrays calls get, which makes an unconditional copy of all spikes by createSpikes creating pairs from the underlying node_id and timestamps. Back in these pairs are destructured into two new arrays. Can’t we create arrays of only the data that is required (ie: based on the gids/times), so they aren’t large, and create them in the format that is required? Having an Iterator would probably make that easier.

It makes sense, I wanted to reuse code but is true in case of calling getArrays it doesnt make sense to call get() to make the pairs only for the filtering. I pushed some changes to filter directly in the getArrays() method.

ferdonline commented 1 year ago

On the neurodamus side, isn’t it useful to only load spikes that are going to be used in the simulation rather

In neurodamus we can't always filter ahead of time because of setup like coreneuron+save-restore, so it was ok. But I get it that maybe for other uses we want filters more often.

mgeplf commented 1 year ago

Comparing the get_dict to the get() method, it seems like there is a performance regression when getting a subset of ids:

    def read_all_subset_nodes():
        sp = sr['All']
        node_ids = libsonata.Selection(ids)
        spikes = sp.get(node_ids)

    def read_all_subset_nodes_structure():
        sp = sr['All']
        node_ids = libsonata.Selection(ids)
        spikes = sp.get_dict(node_ids)

    print("read_all_subset_nodes")
    timeit(read_all_subset_nodes)

    print("read_all_subset_nodes_structure")
    timeit(read_all_subset_nodes_structure)

Gives:

************* spikes-decimated-unsorted.h5
read_all_subset_nodes
1 loop, best of 5: 130.22718537412584 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.7277841931209 per loop

************* spikes-decimated-sorted-by-time.h5
read_all_subset_nodes
1 loop, best of 5: 130.0559630645439 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.6187489759177 per loop

************* spikes-decimated-sorted-by-ids.h5
read_all_subset_nodes
1 loop, best of 5: 96.04782155901194 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.36246633064 per loop

I'm not a big fan of different access mechanisms having much different execution profiles, because that means library consumers have to benchmark everything to know what to use - which isn't very ergonomic.

In this case, we'll have to fix it later.

mgeplf commented 1 year ago

Comparing the get_dict to the get() method, it seems like there is a performance regression when getting a subset of ids:

I should also add that the fast path (ie: get_dict()) is much faster: (new vs old): 0.36195594631135464 vs 9.994007629342377, so it's very worthwhile.

mgeplf commented 1 year ago

Nice, fixing the regression makes a big difference:

************* spikes-decimated-unsorted.h5
read_all_subset_nodes
1 loop, best of 5: 131.17122020898387 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.58019144897116 per loop
************* spikes-decimated-sorted-by-time.h5
read_all_subset_nodes
1 loop, best of 5: 131.06627149000997 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.91921105398796 per loop
************* spikes-decimated-sorted-by-ids.h5
read_all_subset_nodes
1 loop, best of 5: 96.16886089701438 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.54882664396428 per loop
mgeplf commented 1 year ago

Thanks @jorblancoa!