brainglobe / brainglobe-utils

Shared general purpose tools for the BrainGlobe project
MIT License
11 stars 1 forks source link

[Feature] compare/replace cell matching code with `scipy.optimize` implementation #77

Closed alessandrofelder closed 3 months ago

alessandrofelder commented 3 months ago

Is your feature request related to a problem? Please describe. The core developer team is not confident we'll be able to maintain the cell matching code long-term (mainly because I don't feel like I entirely understand each line :sweat_smile: !)

Describe the solution you'd like Unless it's a lot less performant, we should consider replacing the core logic of our cell matching functionality with scipy.optimize.linear_sum_assigment, to reduce the maintenance burden. This is not exactly the same algorithm, but is related and useful for the same class of problems, AFAICT.

A solution like this would also fit well with the idea of BrainGlobe bringing the neuroanatomical context to the Python ecosystem, rather than re-implementing parts of it.

Describe alternatives you've considered

\

Additional context

\

matham commented 3 months ago

I do agree with that we shouldn't re-implement stuff and reuse as much as possible.

The main reason why I re-implemented it, is that every existing implementation that solves this problem expects a cost matrix input of size NxM. For our example with 250k cells, this would have required a 200+GB matrix for float32.

There may be a way to hack a fake numpy array that calls a function that computes the distance for each call to cost[i, j], but I can't imagine that will be very performant. Perhaps we can try to get scipy to accept a ufunc instead of requiring a cost matrix.

I'll be honest that while I spent some time reading the algorithm and ideas behind it and I think I understand it at a cursory level, I don't understand it deep enough to properly debug. However, I also expect this kind of algorithm, and given where it comes from, that it should be a set and forget as long as no one changes any of the logic!?

matham commented 3 months ago

I opened an issue with scipy: https://github.com/scipy/scipy/issues/20729.

alessandrofelder commented 3 months ago

However, I also expect this kind of algorithm, and given where it comes from, that it should be a set and forget as long as no one changes any of the logic!?

This is a fair point. As long as no one changes any of the logic and numba doesn't evolve too weirdly. But even that should be fine. The hypothetical scenario that worries me is

  1. we make some changes to the cell detection code that we are sure (!) will not change the cell matching
  2. the cell matching reports changes
  3. we start suspecting a bug in the cell matching

But we should question point 1. in that case. :laughing:

For our example with 250k cells, this would have required a 200+GB matrix for float32.

Writing down some related thoughts that might help down the line (this is probably not a priority for the core team right now)

matham commented 3 months ago

It's a fair worry that issues with the algorithm would lead to misjudging changes in e.g. cellfinder. Perhaps if we think of the algorithm as more of another input to consider when looking at changes in cell counting rather than the histogram must look a certain way!?

dask array

The issue there is that you'd still need a large harddrive, just for this computation - and it'd probably take a while to compute and create the cost data on disk. But even then, given that the algorithm needs to constantly query the cost of different indices, the CPU having to look it up on a harddrive, even a fast SSD would significantly reduce performance. Compared to in memory access, I don't think it'd be fast enough!?

Optimization

I added the zero-distance optimization you mentioned to the original PR!

alessandrofelder commented 3 months ago

I think with the improvements made to #74 after this was opened, we can close this as not planned (for now). Our cell matching now is tailored to BrainGlobe's needs (it is fast, and it allows "pre-matching", and it is tested. We can revisit if the scipy issue is addressed if we like.