brainglobe / brainglobe-utils

Shared general purpose tools for the BrainGlobe project
MIT License
11 stars 1 forks source link

Add cell matching support #74

Closed matham closed 4 months ago

matham commented 5 months ago

Description

What is this PR

While working on the pytorch changes, I wanted to have a way of comparing how my changes affect cell detection. Because the filtering in pytorch is potentially slightly different. So I needed a way to compare two sets of cells for distance.

This PR adds support for this. It uses the Hungarian method to compute the best paring between two groups of cells such that the euclidean distance within a pair across all pairs is minimized, I really hope someone already implemented it in the repos. I searched but didn't find anything 😅

The algorithm is of time complexity O(n3). Which seemingly is the fastest algorithm around for this problem. I adapted the code directly from Wikipedia.

References

None

How has this PR been tested?

I tested it matching two sets of 250k cells, one from the main branch and one my working changes and it took about 1.5 hours.

I also added tests.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

No.

Checklist:

For the actual example to see how it works. I compared my changes to main. The matching was able to pair 245952 cells, with 8426 left unpaired due to unequal number of cells found between the two sets.

Looking at the histogram of distance between all the pairs we get this image, which is pretty good IMO: dist

I also visualized the histogram in x, y, and z of the 8426 unmatched cells to see if something stands out but nothing stood out:

zdist ydist xdist

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.68%. Comparing base (66e88a4) to head (acf0d63). Report is 9 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #74 +/- ## ========================================== + Coverage 90.38% 91.68% +1.30% ========================================== Files 35 35 Lines 1393 1623 +230 ========================================== + Hits 1259 1488 +229 - Misses 134 135 +1 ``` | [Flag](https://app.codecov.io/gh/brainglobe/brainglobe-utils/pull/74/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=brainglobe) | Coverage Δ | | |---|---|---| | [numba](https://app.codecov.io/gh/brainglobe/brainglobe-utils/pull/74/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=brainglobe) | `91.62% <100.00%> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=brainglobe#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

adamltyson commented 5 months ago

Thanks for this @matham, much appreciated!

I will let @alessandrofelder review this, and these questions are for him really:

alessandrofelder commented 5 months ago

Thanks a lot @matham - I had a more simplistic version of something similar in a local script (no global optimisation or guaranteed 1-1 pairing, just brute-force finding minimum distance for each point in any point in the other) that I could/should have shared - apologies! This is definitely very useful functionality!!!

I am just wondering whether it's worth at some point replacing the core logic of match_cells with scipy.optimize.linear_sum_assigment, to reduce our long-term maintenance burden (I can't confidently say I understand each step well enough)? I've opened an issue for this, and appreciate your thoughts if you are willing to spend time.

I am satisfied the implementation matches the Wikipedia one AFAICT, and testing it on local data seems to give reasonable outputs (see below), so I'll approve once I've done some of the comparisons Adam suggested.

I think this would benefit from some negative tests (checking e.g. that expected errors are thrown for invalid input) as sanity checks, and also to keep the coverage 90%+ (something @K-Meech has worked really hard to achieve) - I can work on this (https://github.com/brainglobe/brainglobe-utils/issues/76).

  • Should we document this in a separate part of the website (somewhere under development)? Otherwise I fear it will be forgotten the next time it's needed?

  • Is this the best repo for this? This won't be used by any tools directly, and it adds a numba dependency. Ideally we'd have a dev tools repo, but considering how much time we've spent reducing the number of repos we have, on balance I think this is the correct place.

At today's developer meeting we decided that

I'll track this in https://github.com/brainglobe/brainglobe.github.io/issues/187

We should use this to compare:

* Old cellfinder (pre numba)

* Previous release (pre @matham's optimisations)

* Main (with @matham's optimisations)

* Future (@matham's torch implementation)

I've started doing some comparisons today, but will finish them tomorrow with a clearer mind :)

alessandrofelder commented 5 months ago

I've compared cell candidate detection results before (old_cells) and after (new_cells) https://github.com/brainglobe/cellfinder/pull/398. I've run match_cells(old_cells, new_cells, threshold=5) which throws out matches that are more than 5 pixels apart.

I ran on cells with z coordinate in slices 1000-1050 (somewhere in the middle) of a cFOS stained brain. The "border" consists of slices 1000-1005 and 1045-1050 (I differentiate here because it's possible these cells would have a close match e.g. in slice 999 or 1051).

The result:

There are 96835 old cells
There are 97033 new cells
Matching cells: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96835/96835 [08:41<00:00, 185.66cells/s]
There are 6042 old cells without a good match in new cells
There are 6240 new cells without a good match in old cells
88954/90793 of good (<5) matches are exact (distance==0).
There are 6042 old cells with no good match in new cells, 4646 of which are not on the border
There are 6240 new cells with no good match in old cells, 4814 of which are not on the border

The distribution of good matches with distance > 0 is:

distances

So far, so good.

But then I ran the script below:

from pathlib import Path
from brainglobe_utils.cells.cells import match_cells
from brainglobe_utils.IO.cells import get_cells, save_cells

if __name__ == "__main__":
    pre_optimisation_cells = get_cells(str(Path("/media/ceph-neuroinformatics/.../cells.xml")))
    post_optimisation_cells = get_cells(str(Path("/media/ceph-neuroinformatics/.../other_cells.xml")))

    # only consider cells in slices 1000 - 1050
    z_min, z_max = 1000, 1050
    old_cells = [cell for cell in pre_optimisation_cells if cell.z > z_min and cell.z < z_max]
    new_cells = [cell for cell in post_optimisation_cells if cell.z > z_min and cell.z < z_max]
    print(f"There are {len(old_cells)} old cells")
    print(f"There are {len(new_cells)} new cells")

    # minimize total euclidean distance for a 1-1 pairing between old and new.
    # pairings with distance > 5 are considered "bad"
    good_threshold = 5
    no_match_in_new, matching, no_match_in_old  = match_cells(old_cells, new_cells, threshold=good_threshold)

    print(f"There are {len(no_match_in_new)} old cells without a good match in new cells")
    print(f"There are {len(no_match_in_old)} new cells without a good match in old cells")

    old_cells_with_no_match_in_new = [old_cells[i] for i in no_match_in_new]
    new_cells_with_no_match_in_old = [new_cells[i] for i in no_match_in_old]

    save_cells(new_cells_with_no_match_in_old, str(Path.home()/"no_match_in_old_2.xml"))
    save_cells(old_cells_with_no_match_in_new, str(Path.home()/"no_match_in_new_2.xml"))

to visualise "bad" matches. Visualising the bad matches in napari shows that most of the yellow points (no_match_in_old) overlap perfectly (orange_points) with an equivalent red point ([Edit to fix typo]no_match_in_old no_match_in_new):

image

I've probabably done something extremely silly in the script, or misunderstood how this should work, but I'm not comfortable merging this until I know what's going on :thinking: - pointers/thoughts about how to interpret this appreciated!

matham commented 5 months ago

You code seems correct. Can you share the cells xmls so I can look at it? Perhaps there's something wrong with the threshold computation. But I also wonder if the axis were somehow flipped during visualization.

matham commented 5 months ago

Also, here's the scipy code for this - it's quite similar.

alessandrofelder commented 5 months ago

Can you share the cells xmls so I can look at it?

Slight adaptation of the script above (avoiding the cell filtering by z coordinate)

from pathlib import Path
import napari
from brainglobe_utils.cells.cells import match_cells, to_numpy_pos
from brainglobe_utils.IO.cells import get_cells, save_cells

if __name__ == "__main__":
    old_cells = get_cells(str(Path.home()/"cells-z-1000-1050.xml"))
    new_cells = get_cells(str(Path.home()/"other-cells-z-1000-1050.xml"))

    good_threshold = 5
    no_match_in_new, matching, no_match_in_old  = match_cells(old_cells, new_cells, threshold=good_threshold)

    print(f"There are {len(no_match_in_new)} old cells without a good match in new cells")
    print(f"There are {len(no_match_in_old)} new cells without a good match in old cells")

    old_cells_with_no_match_in_new = [old_cells[i] for i in no_match_in_new]
    new_cells_with_no_match_in_old = [new_cells[i] for i in no_match_in_old]

    save_cells(new_cells_with_no_match_in_old, str(Path.home()/"no_match_in_old-z-1000-1050.xml"))
    save_cells(old_cells_with_no_match_in_new, str(Path.home()/"no_match_in_new-z-1000-1050.xml"))

    viewer = napari.Viewer()

    viewer.add_points(to_numpy_pos(new_cells_with_no_match_in_old), edge_color="yellow", face_color="yellow", symbol="ring", opacity=0.5, name="no match in old")
    viewer.add_points(to_numpy_pos(old_cells_with_no_match_in_new), edge_color="red", face_color="red", symbol="ring", opacity=0.5, name="no match in new")
    napari.run()

Should be reproducible with attached files saved to Path.home() and extension changed to .xml

cells-z-1000-1050.txt other-cells-z-1000-1050.txt

matham commented 5 months ago

I think I figured out the issue and it has to do with us finding the global minimum and then removing those above the threshold. This results in the following problem:

Say you have 4 points on a line. Set 1 is {0, 12}, set 2 is {10, 22}. Without threshold, the matching is (0, 10), (12, 22). Because that results in the minimum global cost of 20 If we were then to apply the threshold of 5 after this matching, we would have no matching as both pairs is above 5. Then, looking at the unmatched cells we'd have a potential pair of (10, 12), but it's not in the match.

This is exactly the issue you noticed. Because the pairs of cells that are right on top of each other but not in match, each are actually matched with other cells to minimize the global cost. Then, removing those above threshold, results in the unmatched sets having cells that are right on top of each other.

This can be tested with the following code:

cells1 = [
    Cell((0, 0, 0), Cell.UNKNOWN),
    Cell((12, 0, 0), Cell.UNKNOWN),
]
cells2 = [
    Cell((10, 0, 0), Cell.UNKNOWN),
    Cell((22, 0, 0), Cell.UNKNOWN),
]

missing_c1, good_matches, missing_c2 = match_cells(cells1, cells2, threshold=np.inf)
assert not missing_c1
assert not missing_c2
assert good_matches == [[0, 0], [1, 1]]

missing_c1, good_matches, missing_c2 = match_cells(cells1, cells2, threshold=5)
# before the fixes the following assert is correct
# assert missing_c1 == [0, 1]
# assert missing_c2 == [0, 1]
# assert not good_matches
# with final fixes this is correct
assert missing_c1 == [0, ]
assert missing_c2 == [1, ]
assert good_matches == [[1, 0]]

I changed it to account for the threshold during the matching so this issue doesn't occur. I haven't yet tested it on my whole brain, to be sure, but will do soon.

matham commented 5 months ago

Also, I added tests so there's full coverage. But it needs to be run with NUMBA_DISABLE_JIT=1. I'm not quite sure how to do it in the CI because you definitely want tests with and without it disabled to catch any issues. So I'm not changing the runner to test it with it disabled.

alessandrofelder commented 5 months ago

But it needs to be run with NUMBA_DISABLE_JIT=1. I'm not quite sure how to do it in the CI because you definitely want tests with and without it disabled to catch any issues.

I'd naively suggest implementing the same solution we have in cellfinder here, ie running both.

This is exactly the issue you noticed. Because the pairs of cells that are right on top of each other but not in match, each are actually matched with other cells to minimize the global cost. Then, removing those above threshold, results in the unmatched sets having cells that are right on top of each other.

I think I follow your reasoning and agree with it, but I can still reproduce the above problem (Lots of orange points) locally :thinking_face: so I am confused. I get a lot higher number of "bad" matches this time though (~40k instead of 6k), FWIW.

Let me know how your "real-life" testing goes :)

matham commented 5 months ago

So after more looking, there are multiple things going on, some which is not a bug, but one that was a bug...

There was a bug where a variable was not properly reset. This was fixed in the first commit after your last comment. When translating vector<bool> in_Z(W + 1); from the original algo, I didn't realize that c++ will initialize the vector to zero (I assume it will). But in the source from where wiki gets the algo, it clearly does this reset to False at every iteration (vector<char> used (m+1, false);).


Sadly, with this fix, the algorithm became a lot slower, in particular when using a small threshold (see below some reasons behind this). Which is quite in line with it being O(N3). So, I also added an optional initial step (ON by default), that extracts all zero distance pairs before running the optimization step.

And, after that I also added the option to use scipy as the optimization backend. This is only practical when we do the zero-distance extraction, because scipy cannot handle the full dataset due to memory requirements. However, with scipy, we cannot use the threshold during the matching because scipy doesn't take it as an option, so we can only use it after.

And with that, I compared the matching with and without threshold and with and without scipy. I found that without threshold, the total cost of the matched pairs were the same - which is good as that is what we are minimizing. Although the matched pairs were not identical, which makes sense.

These images, on your two cell sets show the result:

cells_vs_others_scipy-True_threshold-5 cells_vs_others_scipy-False_threshold-5 cells_vs_others_scipy-True_threshold-inf cells_vs_others_scipy-False_threshold-inf


Besides all this, there are complexities with this algorithm that may result in some of what you saw. This algorithm is perhaps not precisely what we want. This algorithm minimizes the total global cost. We don't quite care about the global cost, we want to minimize all the local costs, even if it results in larger global costs.

Some examples how this may appear in the algorithm:

  1. Consider 2 sets of points in 1d: (0, 10), (10, 20). If we get match pairs (0, 10), (10, 20) vs. (0, 20), (10, 10), both will have global cost of 20 so a global matching may select either. But for us, with a local matching, we'd prefer the latter match.

    Using a threshold kinda accomplishes a more local minimization, as points further away are all considered equally far.

    But, scipy doesn't support a threshold and in my implementation, using a threshold, especially smallish, makes it much much slower. I speculate it's because much further away points are not as easily eliminated as they are all equal and so it has to iterate the full set more often (I verified that # iterations increases).

  2. You observed that the number of bad matches (i.e. outside the threshold) increased with the previous changes. I also observed this in my dataset. And this is because (besides the bug, potentially), I changed the cost function from sum(square(distance[i])) to sqrt(sum(square(distance[i]))), so I could use the threshold during the matching. I didn't take the sqrt before as an optimization.

    But, why would that change the matching? Shouldn't the matching stay the same? The reason (I think) is because we're doing a global optimization, not local. And the sqrt scaling is non-linear. So without taking the sqrt, everything gets pushed further apart by power of 2.

    So, assuming most points are close to each other in the data set. The algorithm, to minimize the global distance, has to work really hard to keep as much of them paired as close to each as possible (i.e. doing a better local matching). So it can afford e.g. to not pair points that are e.g. 1 voxel apart, if that pair can help those 2 apart. Because for the latter the cost is 4 vs. just 1!

    On the other hand, now that I am taking the sqrt, the algorithm has less incentive to do focus as much on the local minimization. Because comparing things a distance 1 vs 2 apart is less significant and can be made up with other better matches of things further apart. And so maybe there are more pairs slightly further apart, and when applying the threshold all of a sudden we have more points outside the threshold.

    This is handwavy, but if it wasn't the case (that the globalness of the algorithm results in this), changing the cost to a sqrt shouldn't change the matching.


For comparison, here's how the previous histogram looks like if we don't take the sqrt, using our implementation. You can see, using the threshold it is the same. But if we don't use the threshold, if you look at e.g. the second bar from left, not doing sqrt results in a lot more matches that are closer to each other. Which seems to show my above analysis was correct.

cells_vs_others_scipy-False_threshold-inf_no-sqrt cells_vs_others_scipy-False_threshold-25_no-sqrt

So, I'm not sure what you think about the code now. But my sense is that first extracting all the zero-distance pairs, whatever remains, we care the most about the overall histogram. Because that gives us a sense of how bad the changes in e.g. cellfinder was. And while we would prefer a local optimization algorithm (that e.g. always prioritizes zero-distance pairs) rather than global, I don't know of such an algorithm. So perhaps a the global optimization algorithm is good enough?


The following are points to consider about the API

  1. The threshold, as part of the optimization stage, only works for our implementation, not scipy. And perhaps scipy would never support it. Should I remove our support for it? Should I make it optional so the threshold can be taken into account only after the matching like scipy? Does that make the API too complicated?
  2. Leave whether to take the sqrt a parameter? So we can tune the matching tightness?
  3. I warned in the docs that using a small threshold makes my implementation a lot slower. And that the scipy backend requires a lot of memory and so it should be used with zero-distant points extraction. But, does all this make the API too complicated? Is this getting out of hand? 😅
matham commented 4 months ago

Thanks for the review!

I'm with you on the API and I can make the changes (remove scipy, keep sqrt).

About adding the cells you shared as a test to compare scipy to us. Each xml file is 12MB. Is that (not) too big to add to the repo? Because everyone cloning will have to download it.

I was kinda running into this issue on the pytorch branch because I added images to make sure the 2d/3d filtering is similar, which easily added a few MBs. And I also have 3 25MB images from a very bright brain that is super helpful for regression testing the full process and helped me figure out where there wasn't parity. But that definitely seemed too big to include in the repo. I'm not sure if you have encountered/considered this before!?

alessandrofelder commented 4 months ago

too big to include in the repo. I'm not sure if you have encountered/considered this before!?

Good point. I will update https://github.com/brainglobe/cellfinder/issues/282 with our long-term plan for this (and other BrainGlobe repositories). Short version: we already have too much data in this repo that should get moved to the GIN platform, downloaded with pooch and cached on CI.

I'll put the XML files on GIN now, and happy to do the same with your images if you could temporarily share them via Dropbox/onedrive/google drive/...?

alessandrofelder commented 4 months ago

should get moved to the GIN platform, downloaded with pooch

I've made a sketch PR for the regression test on the XML files to your fork @matham, with how I imagine this working (which are now on GIN) - hope it helps!

alessandrofelder commented 4 months ago

I can make the changes (remove scipy, keep sqrt).

Thank you!! :tada:

About adding the cells you shared as a test to compare scipy to us.

I can add the test itself too, if you like?

matham commented 4 months ago

Ok, removed scipy as an option and filled in the scipy test. Thanks for the pooch PR, I've never used it so that was helpful!

I can add the test itself too, if you like?

Adding tests that pass is the most fun part 😁

alessandrofelder commented 4 months ago

https://github.com/matham/brainglobe-utils/pull/2 may help add more of the tests discussed above. The data you shared with us is on GIN now :)

alessandrofelder commented 4 months ago

https://github.com/matham/brainglobe-utils/pull/2 may help add more of the tests discussed above. The data you shared with us is on GIN now :)

Given that we now have the discussed regression test, and this data is meant for wider use than just cell matching, I think I can merge this now. Thanks very much, @matham ! :tada: