Open jluethi opened 2 weeks ago
I don't think this has to do with table creation, but with overlap checks.
The get_overlapping_pairs_3D
is potentially costly, due to its quadratic scaling with the number of elements. And I think we are calling it in the wrong way:
for i_ROI, indices in enumerate(list_indices):
# ...
if output_ROI_table:
bbox_df = array_to_bounding_box_table(
new_label_img,
actual_res_pxl_sizes_zyx,
origin_zyx=(s_z, s_y, s_x),
)
bbox_dataframe_list.append(bbox_df)
overlap_list = []
for df in bbox_dataframe_list: # <--------- see here
overlap_list.extend(
get_overlapping_pairs_3D(df, full_res_pxl_sizes_zyx)
)
if len(overlap_list) > 0:
logger.warning(
f"{len(overlap_list)} bounding-box pairs overlap"
)
I think we have two issues in the code above:
get_overlapping_pairs_3D
, and then sum all lengthsi_ROI
, we reconstruct the whole list of overlaps - including the ones corresponding to other values of i_ROI
(see "see here" comment in code). At a first look, this is just wrong.I see at least two easy solutions, even without touching point 1 above. They are both
(A) We move the for df in bbox_dataframe_list
block outside the i_ROI
loop, and only run it once at the end of the loop.
(B) We apply a patch like
--- a/fractal_tasks_core/tasks/cellpose_segmentation.py
+++ b/fractal_tasks_core/tasks/cellpose_segmentation.py
@@ -640,11 +640,7 @@ def cellpose_segmentation(
bbox_dataframe_list.append(bbox_df)
- overlap_list = []
- for df in bbox_dataframe_list:
- overlap_list.extend(
- get_overlapping_pairs_3D(df, full_res_pxl_sizes_zyx)
- )
+ overlap_list = get_overlapping_pairs_3D(bbox_df, full_res_pxl_sizes_zyx)
if len(overlap_list) > 0:
logger.warning(
f"{len(overlap_list)} bounding-box pairs overlap"
A couple more details:
get_overlapping_pairs_3D
for one of those many-labels cases (say 1000 labels for a given i_ROI
). Since this function does nothing else than printing a warning, I would be very strict in how long a runtime we can accept. If it turns out that it is slow, we can easily improve it in a trivial way (simply stop after the first overlap is found) or also in more systematic ways (the current approach is close to being the definition of a slow Python function: a quadratic-scaling for
loop, which calls a pure-Python function and then even appends to a list).
I'm observing in a user experiment at FMI that the creation of output ROI tables seems to slow down the task the more ROIs are processed in an image.
The user has a big well with 138 organoid objects and is running Cellpose per organoid object. The early objects took a few minutes to process.
Later objects were much slower to process (~20 min):
The ROI sizes appear to have a similar order of magnitude. But for later organoids,
num_labels_tot=71118
is much higher and it looks like we get very many overlap warnings in 3d:WARNING; 410926 bounding-box pairs overlap
.=> Does ROI table creation rerun for all labels when an organoid processing is finished? Anything else that would explain this? I'll need to look closer into it, just wanted to report the logs here for the time being