fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Review/optimize use of `get_overlapping_pairs_3D` in Cellpose task #779

Open tcompa opened 3 days ago

tcompa commented 3 days ago

Branching from #764 (fixed in principle with #778)

After a fix, it'd be useful to have a rough benchmark of get_overlapping_pairs_3D for one of those many-labels cases (say 1000 labels for a given i_ROI). Since this function does nothing else than printing a warning, I would be very strict in how long a runtime we can accept.

If it turns out that it is slow, we can easily improve it in a trivial way (simply stop after the first overlap is found) or also in more systematic ways (the current approach is close to being the definition of a slow Python function: a quadratic-scaling for loop, which calls a pure-Python function and then even appends to a list).

If the function is not slow, then there's no special need for a refactor.

jluethi commented 3 days ago

Another thing to consider here: This (as well as the original implementation) will only check for potential bounding box overlaps within a given ROI that's being processed, right?

A future improvement may be to check for overlaps across ROIs as well, as in theory, they could also overlap. Given that this is just for warnings, not sure how relevant this is though. Our masking should still work for this after all.

tcompa commented 3 days ago

A future improvement may be to check for overlaps across ROIs as well, as in theory, they could also overlap. Given that this is just for warnings, not sure how relevant this is though. Our masking should still work for this after all.

For the record, making this change is quite straightforward, since we should only move the check from within the ROI loop to its end (EDIT: this diff is not working, but the idea remains valid):

diff --git a/fractal_tasks_core/tasks/cellpose_segmentation.py b/fractal_tasks_core/tasks/cellpose_segmentation.py
index ea78661a..403630a1 100644
--- a/fractal_tasks_core/tasks/cellpose_segmentation.py
+++ b/fractal_tasks_core/tasks/cellpose_segmentation.py
@@ -541,15 +541,6 @@ def cellpose_segmentation(

             bbox_dataframe_list.append(bbox_df)

-            overlap_list = get_overlapping_pairs_3D(
-                bbox_df, full_res_pxl_sizes_zyx
-            )
-            if len(overlap_list) > 0:
-                logger.warning(
-                    f"ROI {indices} has "
-                    f"{len(overlap_list)} bounding-box pairs overlap"
-                )
-
         # Compute and store 0-th level to disk
         da.array(new_label_img).to_zarr(
             url=mask_zarr,
@@ -582,6 +573,15 @@ def cellpose_segmentation(
             bbox_dataframe_list = [empty_bounding_box_table()]
         # Concatenate all ROI dataframes
         df_well = pd.concat(bbox_dataframe_list, axis=0, ignore_index=True)
+
+        overlap_list = get_overlapping_pairs_3D(
+            bbox_dataframe_list, full_res_pxl_sizes_zyx
+        )
+        if len(overlap_list) > 0:
+            logger.warning(
+                f"{len(overlap_list)} bounding-box pairs overlap"
+            )
+
         df_well.index = df_well.index.astype(str)
         # Extract labels and drop them from df_well
         labels = pd.DataFrame(df_well["label"].astype(str))

However we should not do this without actual benchmark, or we risk introducing a very bad scaling. If there are N organoids with M labels each (under the fake assumption that each organoid has the same number of internal objects), we have:

  1. Current main: (N*(N-1)/2) * (M*(M-1)/2) overlap checks
  2. After #778: NM(M-1)/2 overlap checks
  3. After the patch described in this comment: (NM)(N*M-1)/2 overlap checks

E.g. for N=138 organoids (ref #764) and 100 labels (random guess), we'd have:

  1. 46*10^6 checks
  2. 683k checks
  3. 95*10^6 checks

Note that:

If it turns out that it is slow, we can easily improve it in a trivial way (simply stop after the first overlap is found) or also in more systematic ways (the current approach is close to being the definition of a slow Python function: a quadratic-scaling for loop, which calls a pure-Python function and then even appends to a list).

jluethi commented 3 days ago

Agreed on not doing this now. More something to test if this gets benchmarked further :)

I'd say we go with the fix now as you implemented it here. And consider most of this benchmarking in the context of a future Cellpose task refactor to use new OME-Zarr reader/writer strategy