Increasing Cellpose throughput

VarIr commented 2 years ago

Dear Cellpose team, thanks for your great tool.

We love the quality of segmentation we obtain by using Cellpose. Currently, we are trying to figure out whether we can increase throughput to saturate our GPUs, without having to run multiple instances of Cellpose.

We use roughly the following code:

from cellpose import io, models
model = models.Cellpose(gpu=True, model_type='nuclei')
images = [io.imread(x) for x in paths_of_2D_images]
_ = model.eval(images, ..., batch_size=10000)

Apparently, the batch_size parameter does not change throughput (and for 10000 I'd actually expect OOM errors). If I understand Cellpose' execution flow correctly e.g., here, a batch is processed so that the model.eval() is recursively called and each image in the batch processed individually (batch_size variable is not used in the recursion anchor).

Is it possible to perform segmentation on whole minibatches simultaneously, instead of iterating through individual images? If so, this should result in significantly better performance.
Orthogonally, in other parts of the code, I believe performance can be improved by simpler means. For example, the UnetModel._run_tiled(() iterates over tiles of images that are extracted by fancy indexing which creates copies of data, instead of views. The following should result in slightly faster execution and lower memory footprint (slice instead of np.arange):
```
for k in range(niter):
irange = slice(batch_size*k, batch_size*k+batch_size)
y0, style = self.network(IMG[irange], return_conv=return_conv)
y[irange] = y0
```
(I can create a PR for this, if there is interest).

kevinjohncutler commented 2 years ago

@VarIr that's great! Have you quantified the performance difference with the simple change (2) that you proposed?

VarIr commented 2 years ago

Not yet, this just caught my eye while reading the code. I'll have a closer look and actually implement & benchmark this.

VarIr commented 2 years ago

On a closer look UnetModel.eval() seems broken. For example, there are undefined variables used here (channel_axis, z_axis), and here (nolist). That is, the function raises a NameError. I suppose, vanilla UnetModel is currently used nowhere in cellpose, but only CellposeModel which inherits from UnetModel and overrides eval().

I'm wondering whether UnetModel.eval() could be made an abstract function when it's apparently unused?

kevinjohncutler commented 2 years ago

@VarIr You are correct, I hadn't noticed that before. Not sure is there would be any benefit to it, but perhaps. Does this get in the way of the optimizations you described earlier?

VarIr commented 2 years ago

Imo, the benefit would be reduced maintenance effort in cellpose development. The optimization proposal (2) would be obsolete then. Users will anyway see no difference. Proposal (1) about segmentation of minibatches is unrelated to that, and I think this is also more important and could give substantial speed-ups.

carsen-stringer commented 2 years ago

thanks @VarIr if you see a speed-up please make a PR

greenmossball commented 1 year ago

I'm experiencing the same issue with the batch size. I have a a GPU with a lot of memory and I would love to load batches rather than going one by one.

carsen-stringer commented 1 year ago

closing due to inactivity

sophiamaedler commented 1 year ago

Are there any plans to reopen work on this issue?

We are also planning on using cellpose for segmenting large time course datasets (small image tiles but lots of tiles) and have very slow performance since we are not saturating our GPU.

sophiamaedler commented 1 year ago

Dear Cellpose team, thanks for your great tool.

We love the quality of segmentation we obtain by using Cellpose. Currently, we are trying to figure out whether we can increase throughput to saturate our GPUs, without having to run multiple instances of Cellpose.

We use roughly the following code:
from cellpose import io, models
model = models.Cellpose(gpu=True, model_type='nuclei')
images = [io.imread(x) for x in paths_of_2D_images]
_ = model.eval(images, ..., batch_size=10000)
Apparently, the batch_size parameter does not change throughput (and for 10000 I'd actually expect OOM errors). If I understand Cellpose' execution flow correctly e.g., here, a batch is processed so that the model.eval() is recursively called and each image in the batch processed individually (batch_size variable is not used in the recursion anchor).

Is it possible to perform segmentation on whole minibatches simultaneously, instead of iterating through individual images? If so, this should result in significantly better performance.

Orthogonally, in other parts of the code, I believe performance can be improved by simpler means. For example, the UnetModel._run_tiled(() iterates over tiles of images that are extracted by fancy indexing which creates copies of data, instead of views. The following should result in slightly faster execution and lower memory footprint (slice instead of np.arange):
for k in range(niter):
    irange = slice(batch_size*k, batch_size*k+batch_size)
    y0, style = self.network(IMG[irange], return_conv=return_conv)
    y[irange] = y0
(I can create a PR for this, if there is interest).

I quickly tested (2) and it gave me a 5X increase in speed on the same dataset. In case you want help implementing @VarIr I'd be happy to help.

carsen-stringer commented 1 year ago

thanks for testing that fix in 2! Would be happy to accept a pull request for it. For number 1 we'll have to think about how best to do this. I agree it would be faster but it would require some refactoring.

VarIr commented 1 year ago

Sorry I never followed up on (2), because at the time it seemed the code was unreachable. If you already have an implementation @sophiamaedler, feel free to open the PR.

Regarding (1), back then I also realized it would require quite some restructuring, introducing Datasets and DataLoaders. Still I think this would be worthwhile, not only to better saturate single GPUs, but it's also a good starting point to allow for the introduction of (Distributed)DataParallel for multi-GPU training/inference as a subsequent step.

I'll probably not find the time to implement everything myself, but would love to help, if this effort could be shared.

Nespresso2000 commented 2 months ago

Was (1) ever implemented? @carsen-stringer

MouseLand / cellpose

Increasing Cellpose throughput #447