[Feature] compare/replace cell matching code with `scipy.optimize` implementation

alessandrofelder commented 3 months ago

Is your feature request related to a problem? Please describe. The core developer team is not confident we'll be able to maintain the cell matching code long-term (mainly because I don't feel like I entirely understand each line :sweat_smile: !)

Describe the solution you'd like Unless it's a lot less performant, we should consider replacing the core logic of our cell matching functionality with scipy.optimize.linear_sum_assigment, to reduce the maintenance burden. This is not exactly the same algorithm, but is related and useful for the same class of problems, AFAICT.

A solution like this would also fit well with the idea of BrainGlobe bringing the neuroanatomical context to the Python ecosystem, rather than re-implementing parts of it.

Describe alternatives you've considered

\

Additional context

\

matham commented 3 months ago

I do agree with that we shouldn't re-implement stuff and reuse as much as possible.

The main reason why I re-implemented it, is that every existing implementation that solves this problem expects a cost matrix input of size NxM. For our example with 250k cells, this would have required a 200+GB matrix for float32.

There may be a way to hack a fake numpy array that calls a function that computes the distance for each call to cost[i, j], but I can't imagine that will be very performant. Perhaps we can try to get scipy to accept a ufunc instead of requiring a cost matrix.

I'll be honest that while I spent some time reading the algorithm and ideas behind it and I think I understand it at a cursory level, I don't understand it deep enough to properly debug. However, I also expect this kind of algorithm, and given where it comes from, that it should be a set and forget as long as no one changes any of the logic!?

matham commented 3 months ago

I opened an issue with scipy: https://github.com/scipy/scipy/issues/20729.

alessandrofelder commented 3 months ago

However, I also expect this kind of algorithm, and given where it comes from, that it should be a set and forget as long as no one changes any of the logic!?

This is a fair point. As long as no one changes any of the logic and numba doesn't evolve too weirdly. But even that should be fine. The hypothetical scenario that worries me is

we make some changes to the cell detection code that we are sure (!) will not change the cell matching
the cell matching reports changes
we start suspecting a bug in the cell matching

But we should question point 1. in that case. :laughing:

For our example with 250k cells, this would have required a 200+GB matrix for float32.

Writing down some related thoughts that might help down the line (this is probably not a priority for the core team right now)

Not an expert, but wondering whether a Dask array would help us avoiding keeping a huge matrix in memory here?
(optimisation) Could we exclude cells that have an exact match (distance 0) by putting the cells in a set and checking whether each element of the other_cells is in the set. And then apply the Hungarian method to the non-zero distance cells?

matham commented 3 months ago

It's a fair worry that issues with the algorithm would lead to misjudging changes in e.g. cellfinder. Perhaps if we think of the algorithm as more of another input to consider when looking at changes in cell counting rather than the histogram must look a certain way!?

dask array

The issue there is that you'd still need a large harddrive, just for this computation - and it'd probably take a while to compute and create the cost data on disk. But even then, given that the algorithm needs to constantly query the cost of different indices, the CPU having to look it up on a harddrive, even a fast SSD would significantly reduce performance. Compared to in memory access, I don't think it'd be fast enough!?

Optimization

I added the zero-distance optimization you mentioned to the original PR!

alessandrofelder commented 3 months ago

I think with the improvements made to #74 after this was opened, we can close this as not planned (for now). Our cell matching now is tailored to BrainGlobe's needs (it is fast, and it allows "pre-matching", and it is tested. We can revisit if the scipy issue is addressed if we like.

brainglobe / brainglobe-utils

[Feature] compare/replace cell matching code with `scipy.optimize` implementation #77