flatironinstitute / CaImAn

Computational toolbox for large scale Calcium Imaging Analysis, including movie handling, motion correction, source extraction, spike deconvolution and result visualization.
https://caiman.readthedocs.io
GNU General Public License v2.0
640 stars 370 forks source link

Lazy arrays for CNMF #1167

Closed kushalkolar closed 9 months ago

kushalkolar commented 1 year ago

Implements #1155

Adds:

1.LazyArray base class which has relevent parts of the numpy ndarray structure to behave like a numpy array.

  1. LazyArrayRCM, for the reconstructed movie A * C.
  2. LazyArrayRCB, for the reconstructed background b * f. This is actually just a sublcass of LazyArrayRCM to separate them. I am not sure if it's really necessary, we could have one class LazyDenseMovie or something which works for both.
  3. LazyArrayResiduals

These are very useful because they also compute projection images without computing the full dense array.

These should work for visuaslization in most tools that take numpy-like array. It should work in Johannes' ipycanvas based player, and it's already known to work in fastplotlib with random access slicing.

Thoughts:

  1. Do we want to support saving the dense array to disk, such as in hdf5 which can be lazily accessed? I'm not sure how useful this could be because the arrays can be huge in the hundreds of gigabytes range

  2. I have never worked with 3D data, but this shouldn't be difficult to expand to 3D. Johannes might be the right person to ask?

main todo:

Thoughts and a big-picture review would be useful before a more detailed review.

EricThomson commented 1 year ago

It will probably take me a while to really think about this properly in the weeds, but big picture thoughts for now (also forgive if this is somewhat terse I'm jet lagged and have to rush to something soon :smile: ): 1) Does this belong in cnmf or utils or visualization? Is this mainly an aid for visualizating? If so, should it go with utils, vis? Speaking generally, isn't this more generally applicable than cnmf, which is a folder fairly explicitly reserved for the cnmf algorithm machinery, while other stuff goes elsewhere. So (again, thinking big picture modular decomposition here), where should this really go? I'm thinking not the cnmf dir as this can also be really useful ultimately for motion correction visualizations and things like that (and future decompositions if we provide them). 2) I don't love the naming: RCM /RCB are names people won't understand by looking at them. We should think of more transparent names. I know 'reconstructed movie' and reconstructed background are probably too long to write out fully, but rcb and rcm are abbreviations, and I prefer to not have full blown abbreviations that people will have to learn. 3) Background will ultimately be general not just CNMF, right ? Currently bg is just CNMF but we'd need CMFE added (i.e., need b0 and W for CNMFE), right? This is probably important to get updated on first pass. 4) I would prefer to use more numpythonic @ for matrix multiplication. Last time I checked there was no difference in terms of profiling, but semantically I think @ is better/more transparent than dot which will tend to confuse people who will think "Why are they doing a dot product rather than matrix multiplication?" As time goes on this will get more confusing to modern readers. I know caiman has dot all over the place but that's because it is basically legacy linear algebra code from before numpy changed the notation, and we should just update it everywhere at some point. [Hackathon! :smile: ]

These are just my big picture thoughts for now I will have time to go into the weeds next week.

kushalkolar commented 1 year ago

It will probably take me a while to really think about this properly in the weeds, but big picture thoughts for now (also forgive if this is somewhat terse I'm jet lagged and have to rush to something soon šŸ˜„ ):

  1. Does this belong in cnmf or utils or visualization? Is this mainly an aid for visualizating? If so, should it go with utils, vis? Speaking generally, isn't this more generally applicable than cnmf, which is a folder fairly explicitly reserved for the cnmf algorithm machinery, while other stuff goes elsewhere. So (again, thinking big picture modular decomposition here), where should this really go? I'm thinking not the cnmf dir as this can also be really useful ultimately for motion correction visualizations and things like that (and future decompositions if we provide them).

This is not a visualization aide, it's a back-end data array structure that makes larger-than-RAM array visualization possible. You can use it for calculations beyond just visualization. This is very CNMF specific and not useful and comletely unrelated to motion correction visualization.

  1. I don't love the naming: RCM /RCB are names people won't understand by looking at them. We should think of more transparent names. I know 'reconstructed movie' and reconstructed background are probably too long to write out fully, but rcb and rcm are abbreviations, and I prefer to not have full blown abbreviations that people will have to learn.

Any ideas?

  1. Background will ultimately be general not just CNMF, right ? Currently bg is just CNMF but we'd need CMFE added (i.e., need b0 and W for CNMFE), right? This is probably important to get updated on first pass.

Right, I will figure this out.

  1. I would prefer to use more numpythonic @ for matrix multiplication. Last time I checked there was no difference in terms of profiling, but semantically I think @ is better/more transparent than dot which will tend to confuse people who will think "Why are they doing a dot product rather than matrix multiplication?" As time goes on this will get more confusing to modern readers. I know caiman has dot all over the place but that's because it is basically legacy linear algebra code from before numpy changed the notation, and we should just update it everywhere at some point. [Hackathon! šŸ˜„ ]

I tested this and yes it's the same between dot and @. BUT np.outer() behaves weird for some reason.

image

kushalkolar commented 1 year ago

So for LazyArrayResiduals all image projections except std are implemented. Computing std(Y - (A* C) - (b * f)) is non-trivial, this is probably best handled with dask so we can hopefully have this in the future!

@EricThomson we will need to chat sometime about b0 and W, I'm not familiar with it :D

kushalkolar commented 1 year ago

This shows how useful the residuals projection can be, left is K = 1, right is K = 4, you can see neurons left over in the residuals projection.

image

kushalkolar commented 1 year ago

I should make the @min, @max as methods and it actually should be possible to get min, max across an axis. Implement nanmin(), nanmax() as just aliases to min(), max().

EricThomson commented 1 year ago

So for LazyArrayResiduals all image projections except std are implemented. Computing std(Y - (A* C) - (b * f)) is non-trivial, this is probably best handled with dask so we can hopefully have this in the future!

* Basically standard deviation is not distributive: `std(Y) - std(A * C) - std(b * f)` != `std(Y - (A* C) - (b * f))` because `sqrt(x - y) != sqrt(x) - sqrt(y)`

* The rest of the operations are distributive which is why we can exploit these tricks :D

@EricThomson we will need to chat sometime about b0 and W, I'm not familiar with it :D

I just found two methods in cnm.estimates I hadn't seen before: compute_background() and compute_residuals() and both handle the ring model and all its funky tributaries nicely.

pgunn commented 1 year ago
  1. I would prefer to use more numpythonic @ for matrix multiplication. Last time I checked there was no difference in terms of profiling, but semantically I think @ is better/more transparent than dot which will tend to confuse people who will think "Why are they doing a dot product rather than matrix multiplication?" As time goes on this will get more confusing to modern readers. I know caiman has dot all over the place but that's because it is basically legacy linear algebra code from before numpy changed the notation, and we should just update it everywhere at some point. [Hackathon! šŸ˜„ ]

If we think it should be done, let's just do it. Converting all instances of .dot() to @ is drudgework; not really suitable for a hackathon. I can probably do it all with regexes without taking much time. Maybe even yapf can do it for us, or some other tool.

kushalkolar commented 1 year ago
  1. I would prefer to use more numpythonic @ for matrix multiplication. Last time I checked there was no difference in terms of profiling, but semantically I think @ is better/more transparent than dot which will tend to confuse people who will think "Why are they doing a dot product rather than matrix multiplication?" As time goes on this will get more confusing to modern readers. I know caiman has dot all over the place but that's because it is basically legacy linear algebra code from before numpy changed the notation, and we should just update it everywhere at some point. [Hackathon! šŸ˜„ ]

If we think it should be done, let's just do it. Converting all instances of .dot() to @ is drudgework; not really suitable for a hackathon. I can probably do it all with regexes without taking much time. Maybe even yapf can do it for us, or some other tool.

Not every place where .dot() is used is meant to be a matrix multiplication so it would be dangerous to do it like that. And without tests for all the linalg in the codebase I wouldn't do it. There is nothing that we gain from changing every .dot() to @ where it's possible. I think most people familiar with numpy will know that .dot() and @ behave the same when it comes to matrix multiplication.

kushalkolar commented 1 year ago

So I guess we've arrived at ModelMovie, MovelBackground, ModelResiduals ?

@EricThomson

EricThomson commented 1 year ago

So I guess we've arrived at ModelMovie, MovelBackground, ModelResiduals ?

@EricThomson

I'm ok with that or Reconstructed or Estimated. In my opinion -- you pick, ultimately any of them work :smile:

pgunn commented 1 year ago

So I guess we've arrived at ModelMovie, MovelBackground, ModelResiduals ? @EricThomson

I'm ok with that or Reconstructed or Estimated. In my opinion -- you pick, ultimately any of them work šŸ˜„

Model is shorter; that's the right choice unless we can find something similarly short that we like better.

EricThomson commented 1 year ago

So I guess we've arrived at ModelMovie, MovelBackground, ModelResiduals ?

@EricThomson

Thinking about this more: one option of the three parts is NeuralModel (this is AC right?), BackgroundModel (this is going to be bf or the ring model of the background), and Residual (this is basically Y-NeuralModel-BackgroundModel, so is basically the error term, all the stuff the model doesn't capture, so isn't really technically part of the model at all).

Or even "movie" . So something like NeuralMovie, BackgroundMovie, and ResidualMovie (maybe bad, just spitballing again).

pgunn commented 1 year ago

So I guess we've arrived at ModelMovie, MovelBackground, ModelResiduals ? @EricThomson

Thinking about this more: one option of the three parts is NeuralModel (this is AC right?), BackgroundModel (this is going to be bf or the ring model of the background), and Residual (this is basically Y-NeuralModel-BackgroundModel, so is basically the error term, all the stuff the model doesn't capture, so isn't really technically part of the model at all).

Or even "movie" . So something like NeuralMovie, BackgroundMovie, and ResidualMovie (maybe bad, just spitballing again).

I'm not a huge fan of Movie, but Caiman already uses the term so maybe we're committed to it. I wonder if an interesting first term might be Focus, e.g. MovieFocus, MovieBackg, MovieResids

EricThomson commented 1 year ago

So I guess we've arrived at ModelMovie, MovelBackground, ModelResiduals ? @EricThomson

Thinking about this more: one option of the three parts is NeuralModel (this is AC right?), BackgroundModel (this is going to be bf or the ring model of the background), and Residual (this is basically Y-NeuralModel-BackgroundModel, so is basically the error term, all the stuff the model doesn't capture, so isn't really technically part of the model at all). Or even "movie" . So something like NeuralMovie, BackgroundMovie, and ResidualMovie (maybe bad, just spitballing again).

I'm not a huge fan of Movie, but Caiman already uses the term so maybe we're committed to it. I wonder if an interesting first term might be Focus, e.g. MovieFocus, MovieBackg, MovieResids

Actually good point that these could get confused with caiman movie object, and one reason I didn't like it is that it is slipping back to naming metadata (what it is) rather than what type of information it contains. That is, the info the three terms contain is the model's predictions of neural activity, the model's predictions of background activity, and the residual (i.e., the stuff the model leaves out, the error term).

pgunn commented 1 year ago

Actually good point that these could get confused with caiman movie object, and one reason I didn't like it is that it is slipping back to naming metadata (what it is) rather than what type of information it contains. That is, the info the three terms contain is the model's predictions of neural activity, the model's predictions of background activity, and the residual (i.e., the stuff the model leaves out, the error term).

On that narrow point, I don't think it's bad (at all) to name classes partly by implementation details (provided the name doesn't get too long). It's super common in programming to do so ; the distinction between an int and a long is entirely implementation details even though both are meant to hold numbers. There are commonly a few ways to represent something anyhow, and they often should have distinct names based on the how.

EricThomson commented 1 year ago

Actually good point that these could get confused with caiman movie object, and one reason I didn't like it is that it is slipping back to naming metadata (what it is) rather than what type of information it contains. That is, the info the three terms contain is the model's predictions of neural activity, the model's predictions of background activity, and the residual (i.e., the stuff the model leaves out, the error term).

On that narrow point, I don't think it's bad (at all) to name classes partly by implementation details (provided the name doesn't get too long). It's super common in programming to do so ; the distinction between an int and a long is entirely implementation details even though both are meant to hold numbers. There are commonly a few ways to represent something anyhow, and they often should have distinct names based on the how.

Yes, but here we are arguing about space so want to eliminate words, and in Python it is considered a code smell to name variables with type flags e.g., data_float. But foolish consistency and all that -- it can sometimes be useful.