brainglobe / cellfinder-core

Standalone cellfinder cell detection algorithm
https://brainglobe.info/documentation/cellfinder/index.html
BSD 3-Clause "New" or "Revised" License
19 stars 16 forks source link

cell candidate speed issues #170

Closed adamltyson closed 1 year ago

adamltyson commented 1 year ago

Although the memory issues have been fixed, there have been reports (e.g. on image.sc) than cellfinder-core is now much slower than it used to be.

codezbq commented 1 year ago

Issue detail: In the stage of cell detection, the time consuming of every iter keeps increasing, and having it goes like this, the thing will never end: Processing planes: 3%|▎ | 177/5749 [23:32<102:34:01, 66.27s/it] Processing planes: 3%|▎ | 178/5749 [26:08<144:01:18, 93.07s/it] Processing planes: 3%|▎ | 179/5749 [29:22<190:47:38, 123.31s/it] Processing planes: 3%|▎ | 180/5749 [33:06<237:28:34, 153.51s/it] Processing planes: 3%|▎ | 181/5749 [36:33<262:20:48, 169.62s/it] Processing planes: 3%|▎ | 182/5749 [40:42<299:24:34, 193.62s/it] Processing planes: 3%|▎ | 183/5749 [45:55<354:39:27, 229.39s/it] Processing planes: 3%|▎ | 184/5749 [51:31<403:44:12, 261.18s/it] Processing planes: 3%|▎ | 185/5749 [57:39<453:13:32, 293.24s/it] Processing planes: 3%|▎ | 186/5749 [1:04:05<496:27:10, 321.27s/it] Processing planes: 3%|▎ | 187/5749 [1:11:18<547:57:29, 354.67s/it] Processing planes: 3%|▎ | 188/5749 [1:19:00<597:38:12, 386.89s/it] Processing planes: 3%|▎ | 189/5749 [1:26:59<640:20:02, 414.60s/it] Processing planes: 3%|▎ | 190/5749 [1:35:26<683:01:44, 442.33s/it]

adamltyson commented 1 year ago

Hi @codezbq which version of cellfinder-core are you using?

@alessandrofelder we benchmarked this new version for speed right? I was using it yesterday and does seem quite a bit slower.

codezbq commented 1 year ago

Hi @codezbq which version of cellfinder-core are you using?

@alessandrofelder we benchmarked this new version for speed right? I was using it yesterday and does seem quite a bit slower.

The newest version I suppose, image

codezbq commented 1 year ago

it keeps stuck at the 314 plane a whole day. Maybe it's the problem with the numba jit? It turns out it's not the problem with any accumulated thing, the time costing of every iter depends on the current plane's property. When the signal appears, the time costing gets really crazy. However the cPython version like 0.2.8 works fine now. I wonder that there could be much difference in the underlying design of the code in this stage between the cython version and the numba version now. And the difference could cause a great time consumption that the numba jit cannot make up. Maybe let's try to refractor it with cuda jit?

adamltyson commented 1 year ago

Maybe let's try to refractor it with cuda jit?

Good idea, we already rely on CUDA for the classification, so it shouldn't make installation any more difficult. I assume numba could be configured to fall back to CPU if it can't detect a GPU. @alessandrofelder?

alessandrofelder commented 1 year ago

Thanks @codezbq and @adamltyson for flagging this and for suggesting possible ways forward. I don't know off the top of my head how easy it would be to flexibly switch between GPU and CPU depending on the hardware it's running on.

Unfortunately we only benchmarked individual (key) functions to decide whether numba was performant enough to compare to cython and therefore worth pursuing, not on a real-life example - which in retrospect seems naive.

I will see whether I can reproduce, investigate and report back here.

codezbq commented 1 year ago

Great, I'd like to look into the code and try to do some help in the project.

And here's a thing peculiar to me:

In the ball filter, the filtering step is like


@njit()
def _walk(
    max_height: int,
    max_width: int,
    tile_step_width: int,
    tile_step_height: int,
    inside_brain_tiles: np.ndarray,
    volume: np.ndarray,
    kernel: np.ndarray,
    ball_radius: int,
    middle_z: int,
    overlap_threshold: float,
    THRESHOLD_VALUE: int,
    SOMA_CENTRE_VALUE: int,
) -> None:
    """
    Scan through *volume*, and mark pixels where there are enough surrounding
    pixels with high enough intensity.

    The surrounding area is defined by the *kernel*.

    Parameters
    ----------
    max_height, max_width :
        Maximum offsets for the ball filter.
    inside_brain_tiles :
        Array containing information on whether a tile is inside the brain
        or not. Tiles outside the brain are skipped.
    volume :
        3D array containing the plane-filtered data.
    kernel :
        3D array
    ball_radius :
        Radius of the ball in the xy plane.
    SOMA_CENTRE_VALUE :
        Value that is used to mark pixels in *volume*.

    Notes
    -----
    Warning: modifies volume in place!
    """
    for y in range(max_height):
        for x in range(max_width):
            ball_centre_x = x + ball_radius
            ball_centre_y = y + ball_radius
            if _is_tile_to_check(
                ball_centre_x,
                ball_centre_y,
                middle_z,
                tile_step_width,
                tile_step_height,
                inside_brain_tiles,
            ):
                cube = volume[
                    x : x + kernel.shape[0],
                    y : y + kernel.shape[1],
                    :,
                ]
                if _cube_overlaps(
                    cube,
                    overlap_threshold,
                    THRESHOLD_VALUE,
                    kernel,
                ):
                    volume[
                        ball_centre_x, ball_centre_y, middle_z
                    ] = SOMA_CENTRE_VALUE

Maybe there's something I didn't follow, is it necessary to change the volume inplace during the process? I assume this filtering operation relies on the original values of the data, then changing those values during the operation can lead to inaccurate results. Maybe we need to ensure the data consistency. Especially when we are going to accelerate it with cuda or something, the change inplace may cause conflict and GIL lock problems. For parallelization and performance, keeping a separate input and output can result in a significant speedup in processing time.

For example I've seen some other filtering process like:

@jit.rawkernel()
def laplace(Uin, Uout, M, nc, mc, pc):
    """CUDA kernel for the Laplacian inside the volume.

    The 3 arrays Uin, Uout, M (mask) have size (nc, mc, pc).

    """

    # Current voxel
    i = 1 + jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x
    j = 1 + jit.blockIdx.y * jit.blockDim.y + jit.threadIdx.y
    k = 1 + jit.blockIdx.z * jit.blockDim.z + jit.threadIdx.z

    if (1 <= i) and (1 <= j) and (1 <= k) and (i <= nc - 2) and (j <= mc - 2) and (k <= pc - 2):
        m = M[i, j, k]
        if m == V_VOLUME:
            Uout[i, j, k] = 1./6 * (
                Uin[i - 1, j, k] +
                Uin[i + 1, j, k] +
                Uin[i, j - 1, k] +
                Uin[i, j + 1, k] +
                Uin[i, j, k - 1] +
                Uin[i, j, k + 1])

Which uses a Uin and a Uout for the filtering process.

adamltyson commented 1 year ago

@codezbq I would need to check this specific function to be sure. Generally though, if the operation is carried out in place it's to save memory, otherwise we'd need an input & output plane in memory at once. That's the reason why it's currently this way though, I'm sure it could be improved.

dstansby commented 1 year ago

Could you share the size of data you're working with, the data type, and any parameters you're setting to non-default values when running cellfinder-core?

On data that has shape (100, 3992, 5680) and uint8 uint16 dtype, I get pretty steady performance of 1 plane processed every 1 - 2 seconds across the entire 100 planes.

adamltyson commented 1 year ago

@dstansby what about 16bit data (which is more common in microscopy)?

codezbq commented 1 year ago

@dstansby Mine is (12601,8881) and there are 5000 planes. 16 bit depth. And yeah, the first 200-300 planes goes fine but then the time consuming rocket up. What I set was only -v 3 1 1 --orientation asr --atlas allen_mouse_10um Howevery I've used a far old version of cellfinder to successfully got the result. Talking about 8bit, I thought we could only use 16bit according to the forum and the documentation, right?

dstansby commented 1 year ago

I misstyped, double checking I do have uint16 data. Good to know that we need to be testing at least ~300 planes to reproduce this.

codezbq commented 1 year ago

I misstyped, double checking I do have uint16 data. Good to know that we need to be testing at least ~300 planes to reproduce this.

@dstansby Sorry I may not made it clear. Because the data was about the whole brain, the first ~300 planes was the ones without much signal. And after that, the planes with signal come, and the time rockets up. So when I tested it by letting it start from my 300th plane, then it got a huge time consuming of every plane from the beginning. So there may not be accumulated stuff to drag down the speed.

adamltyson commented 1 year ago

@codezbq that's great to hear (for us, not so great for you!). I wonder if we can detect this somehow and provide a warning? I also wonder if this is something that can be improved by changing the cell detection parameters?

alessandrofelder commented 1 year ago

Thanks for keeping us up-to-date @codezbq . Like Adam, I am wondering whether you can exclude those spots by playing with the cell detection parameters, especially if they are a different size to the neurons you are trying to detect?

codezbq commented 1 year ago

Oh hey. One month ago I tried to go with the reply in the topic on Image.sc [RAM overload crashing with “BrokenPipeError: [Errno 32] Broken pipe”]) and installed the version 0.4.20 to test it. I didn't realize it wasn't the newest version. Now I updated it to the newest 0.4.21 version and the problem got solved. So it doesn't matter with the tiny spots. The time consuming and everything goes well, even there're so many tiny spots artifacts on our data. Anyway no problem for running now. Now the problem is to adjust the parameters to get the best result.

adamltyson commented 1 year ago

That's weird, 0.4.21 shouldn't be any different in terms of performance, it was just released to fix a bug with something else. Going to close this issue, but please let us know if you have any further problems @codezbq.