Create core feature measurements

fractal-analytics-platform / fractal-client

Command-line client for Fractal

https://fractal-analytics-platform.github.io/fractal-client

BSD 3-Clause "New" or "Revised" License

45 stars 1 forks source link

Create core feature measurements #69

Closed jluethi closed 2 years ago

jluethi commented 2 years ago

We want to build some core feature measurements into Fractal. As a start, I'd suggest we build the regionprops measurements into Fractal, as an easy implementation of relevant feature measurements.

For feature measurements, especially things like measuring volume and morphologies, we need to keep in mind that our Z dimension is different from our X & Y dimension (quite drastically, e.g. the UZH test set is 0.1625 um in x & y, 1um in Z). This is metadata we will need to pass through the whole pipeline (OME-Zarr has a spot where these scale parameters can be saved, will document this further. Also, could either be an input in the initial Zarr parsing or something we read from the metadata file, see here: https://github.com/fractal-analytics-platform/mwe_fractal/issues/46). This needs to be taken into account for some measurements. Regionprops doesn't have core support for this. We'll need to check what workarounds have been used in the Liberali lab so far.

Later, we can start implementing more feature measurements that we built into the abbott library in the Pelkmans lab (you should all have received invitations to the library just now), see here: https://github.com/pelkmanslab/abbott/blob/master/lib/src/abbott/feature_extraction.py These are build around ITK, a more powerful but also more complex measurement library

=> let's start building some measurements with regionprops and then extend later to the abbott measurements

jluethi commented 2 years ago

As a good first step, let's implement these measurements: https://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.regionprops

They take a label image and an optional intensity image.

Based on the label image, things like area, centroid, eccentricity, perimeter and others can be measured for the given objects

Based on a label image + an intensity image, we can get intensity statistics per object, like mean intensity, min, max, sum, standard deviation etc.

A user may need to decide which channels need to be measured for which objects (=> which label & intensity image combinations we run measurements for). All measurements for a given label image should then be saved to the same table that points to that label image

tcompa commented 2 years ago

For feature measurements, especially things like measuring volume and morphologies, we need to keep in mind that our Z dimension is different from our X & Y dimension (quite drastically, e.g. the UZH test set is 0.1625 um in x & y, 1um in Z). This is metadata we will need to pass through the whole pipeline (OME-Zarr has a spot where these scale parameters can be saved, will document this further. Also, could either be an input in the initial Zarr parsing or something we read from the metadata file, see here: fractal-analytics-platform/fractal-tasks-core#25). This needs to be taken into account for some measurements. Regionprops doesn't have core support for this. We'll need to check what workarounds have been used in the Liberali lab so far.

I think that this should be doable with the spacing argument of regionprops.

This got merged in April 2022 (https://github.com/scikit-image/scikit-image/pull/6296), thus we should check whether some recent version of scikit-image is needed. The last release (version 0.19.3) doesn't have it yet:

Traceback (most recent call last):
  File "prototype_measurement_regionprops.py", line 15, in <module>
    props = regionprops(label_image, intensity_image=intensity_image, spacing=(1,1,1))
TypeError: regionprops() got an unexpected keyword argument 'spacing'

tcompa commented 2 years ago

As a future reference, the one below is a trivial piece of code that loads a mask and the corresponding image and just applies regionprops. Turning into a task requires some more logic about where to read data/labels and how/where to store the measurements (which ones?).

import numpy as np
import dask.array as da
from skimage.measure import regionprops

rootdir = ("/data/active/fractal/tests/Temporary_data_UZH_1_well_2x2_sites/" +
          "20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/")
ind_channel = 0
ind_level = 0

intensity_image = da.from_zarr(rootdir + f"{ind_level}")[ind_channel]
label_image = da.from_zarr(rootdir + f"labels/label_DAPI/{ind_level}")
assert label_image.shape == intensity_image.shape
print(label_image.shape)

props = regionprops(label_image, intensity_image=intensity_image)
print(len(props))

For the record: image size is (10, 4320, 5120), number of labels is 13k, and this snippet runs in a few seconds.

jluethi commented 2 years ago

Neat! Let's first see whether we can do the measurements as part of the napari assistant workflows (see here: https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/32). If not, then let's implement a simple regionprops for some of the standard intensity & morphology features ourselves :)

jluethi commented 2 years ago

We can use the napari workflows framework to calculate feature measurements and return dataframes.

The napari-skimage-regionprops library wraps regionprops and makes it nicely callable from the napari assistant. It then creates a table of feature measurements in napari. Unfortunately, the assistant doesn't yet have support to add measurements to the napari workflow when they are made, but that sounds like something that can be fixed soon (see here: https://github.com/haesleinhuepf/napari-assistant/issues/27).

In the meantime, I will manually create some napari-workflows pipelines that return feature measurements tables, because the napari-workflows library supports dataframes as an output :)

=> we don't need to create our own feature measurement wrappers, we can build upon existing wrappers like napari-skimage-regionprops and contribute additional features to them where needed (e.g. support for anisotropy if it's not in there yet)

See dummy case here of how a napari-workflow with dataframes as an output can look (full examples using regionprops are in https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/31):

from napari_workflows import Workflow
from napari_workflows._io_yaml_v1 import load_workflow, save_workflow

from skimage.io import imread, imshow
from skimage.filters import gaussian
import numpy as np
import time
import pandas as pd

def pseudo_measurements(img):
    df = pd.DataFrame([1, 2, 3, 4])
    return df

w = Workflow()

# define denoising
w.set("denoised", gaussian, "input", sigma=2)

w.set("output_1", gaussian, "output", sigma=3)

w.set("output_2", pseudo_measurements, "output_1")

w.set("input", np.random.random((3, 3)))

print(str(w))

df = w.get("output_2")

print(df)

jluethi commented 2 years ago

For August, let's focus on getting the core measurements with regionprops and saving them to an AnnData table in the OME-Zarr file. I think we should aim for having a second AnnData table for each well: nuclear_measurements_table. This is an addition to the existing ROI table.

To get this measurement, we can use the napari workflows in a very simple manner (needs to be looped over ROIs):

from napari_workflows import Workflow
from napari_workflows._io_yaml_v1 import load_workflow, save_workflow
from napari_skimage_regionprops._regionprops import regionprops_table

# Get the workflow
napari_workflow = load_workflow('regionprops_from_existing_labels.yaml')

# Set the input images: DAPI channel image & the label image for the current ROI
napari_workflow.set("dapi_img", dapi_img) # dapi_img is the dapi channel image for the current ROI
napari_workflow.set("label_img", label_img) # label_img is the nuclear label image (generated by cellpose) for the current ROI

# Run the workflow
df = napari_workflow.get('regionprops_DAPI')

Workflow (as .txt file, rename to .yaml): regionprops_from_existing_labels.txt

The task will need to loop over (or run in parallel) all ROIs per well. The resulting dataframes per ROI can then be concatenated & saved as an AnnData table per well.

This tasks brings 2 new dependencies: napari_workflows and napari_skimage_regionprops.

tcompa commented 2 years ago

I started from our typical test dataset (single well, 2x2 FOVs), with labels that look reasonable: Screenshot from 2022-08-29 11-18-31

I then run a test script:

import dask.array as da
import numpy as np
from napari_workflows._io_yaml_v1 import load_workflow

zarrdir = "/data/active/fractal/tests/Temporary_data_UZH_1_well_2x2_sites/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0"

level = 0
dapi_img = da.from_zarr(f"{zarrdir}/{level}")[0]
label_img = da.from_zarr(f"{zarrdir}/labels/label_DAPI/{level}")

dapi_shape = dapi_img.shape
label_shape = label_img.shape

# Find upscale_factor for labels array
assert dapi_shape[0] == label_shape[0]
upscale_factor = dapi_shape[1] // label_shape[1]
assert label_shape[1] * upscale_factor == dapi_shape[1]
assert upscale_factor == dapi_shape[2] // label_shape[2]

# Upscale labels array (https://stackoverflow.com/a/7525345/19085332)
label_img_up_x = np.repeat(label_img, upscale_factor, axis=2)
label_img_up = np.repeat(label_img_up_x, upscale_factor, axis=1)
assert label_img_up.shape == dapi_shape

# Select a single ROI
ROI = (slice(0, 10), slice(0, 2160 // 2 ** level), slice(0,2560 // 2 ** level))

print(f"Resolution level:      {level}")
print(f"Labels upscale factor: {upscale_factor}") 
print(f"Whole-array shape:     {dapi_shape}")
print(f"Single-ROI shape:      {dapi_img[ROI].shape}")
print()

# Get the workflow
napari_workflow = load_workflow("regionprops_from_existing_labels.yaml")

# Set the input images: DAPI channel image & the label image for the current ROI
napari_workflow.set("dapi_img", dapi_img[ROI])
napari_workflow.set("label_img", label_img_up[ROI])

# Run the workflow
df = napari_workflow.get('regionprops_DAPI')

which seems to run through. Run time (with --cpus-per-task=32, leading to a ~3200% CPU usage) is around 8 minutes, for processing a single 10x2160x2560 ROI. This is not great for rapid prototyping/testing, but we can live with it.

The output pandas dataframe looks like

     Unnamed: 0  label   area  bbox_area  convex_area  equivalent_diameter  ...  moments_normalized-3-3-2  moments_normalized-3-3-3  standard_deviation_intensity  minor_axis_length  intermediate_axis_length  major_axis_length
0             0      1   5872      12992         7100            22.383559  ...              7.614688e-06              2.366689e-06                     90.392161           7.326875                 28.845661          57.788916
1             1      2  33432      61952        40385            39.968813  ...              2.390541e-06              3.064581e-06                     79.293774           8.113023                 80.009449         102.389186
2             2      3  29396      50688        34103            38.290985  ...              3.424896e-06             -3.696872e-07                     94.828690           8.339921                 72.956306          94.801798
3             3      4  27844      43520        31071            37.604891  ...              9.490300e-08              9.519223e-07                     98.998050           9.000653                 72.720977          85.465106
4             4      5  18008      25872        19344            32.520381  ...              1.029923e-05              4.430523e-06                     76.368138           8.417954                 49.728946          88.750665
..          ...    ...    ...        ...          ...                  ...  ...                       ...                       ...                           ...                ...                       ...                ...
940         940    941    444        792          588             9.465163  ...             -3.684569e-05             -1.275243e-05                     49.523391           3.001395                 12.439926          25.285901
941         941    942    984       2520         1583            12.340483  ...              3.285643e-05             -7.208546e-05                     43.710506           3.594016                 24.240025          37.316303
942         942    943   2040       4560         3164            15.735378  ...              1.543890e-05             -4.116825e-06                     44.078668           3.491744                 31.604142          41.318568
943         943    944    116        240          173             6.050897  ...              1.146538e-04             -5.834331e-05                     87.446301           3.145374                  9.157059          10.032667
944         944    945    276        648          417             8.077993  ...              2.091570e-05             -5.764227e-06                    102.178972           2.092620                 13.333968          23.199819

[945 rows x 223 columns]

Iterating over ROIs will produce 2*2=4 dataframes like this. What then? Should we just concatenate them into a single one and dump into a new AnnData table next to tables/FOV_ROI_table? I guess that's the goal, but this is just to make sure.

tcompa commented 2 years ago

Useful gotcha: I tried working at a coarser pyramid level (i.e., setting a value of level larger than 0), to reduce run times. This leads to some "corrupt" labels (I guess it is for those which were extremely small at the original resolution, and then become 1D at a coarser level). The resulting error looks like


QH6013 qhull input error: input is less than 3-dimensional since all points have the same x coordinate    0

While executing:  | qhull i Qt
Options selected for Qhull 2019.1.r 2019/06/21:
  run-id 1256097137  incidence  Qtriangulate  _pre-merge  _zero-centrum
  _max-width  1  Error-roundoff 1.4e-15  _one-merge 9.7e-15
  _near-inside 4.9e-14  Visible-distance 2.8e-15  U-max-coplanar 2.8e-15
  Width-outside 5.5e-15  _wide-facet 1.7e-14  _maxoutside 1.1e-14

  return convex_hull_image(self.image)
/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_regionprops.py:395: UserWarning: Failed to get convex hull image. Returning empty image, see error message below:
QH6013 qhull input error: input is less than 3-dimensional since all points have the same x coordinate    0

While executing:  | qhull i Qt
Options selected for Qhull 2019.1.r 2019/06/21:
  run-id 1256113944  incidence  Qtriangulate  _pre-merge  _zero-centrum
  _max-width  1  Error-roundoff 1.4e-15  _one-merge 9.7e-15
  _near-inside 4.9e-14  Visible-distance 2.8e-15  U-max-coplanar 2.8e-15
  Width-outside 5.5e-15  _wide-facet 1.7e-14  _maxoutside 1.1e-14

  return convex_hull_image(self.image)
/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_regionprops.py:577: RuntimeWarning: divide by zero encountered in long_scalars
  return self.area / self.area_convex
Traceback (most recent call last):
  File "measurement.py", line 41, in <module>
    df = napari_workflow.get('regionprops_DAPI')
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/napari_workflows/_workflow.py", line 83, in get
    return dask_get(self._tasks, name)
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/napari_skimage_regionprops/_regionprops.py", line 103, in regionprops_table
    table = sk_regionprops_table(np.asarray(labels).astype(int), intensity_image=np.asarray(image),
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_regionprops.py", line 996, in regionprops_table
    return _props_to_dict(
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_regionprops.py", line 807, in _props_to_dict
    column_buffer[i] = regions[i][prop]
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_regionprops.py", line 675, in __getitem__
    value = getattr(self, key, None)
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_regionprops.py", line 434, in feret_diameter_max
    coordinates, _, _, _ = marching_cubes(identity_convex_hull,
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_marching_cubes_lewiner.py", line 133, in marching_cubes
    return _marching_cubes_lewiner(volume, level, spacing,
  File "/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_marching_cubes_lewiner.py", line 177, in _marching_cubes_lewiner
    raise ValueError("Surface level must be within volume data range.")
ValueError: Surface level must be within volume data range.

jluethi commented 2 years ago

Great to see, thanks @tcompa !

Run time (with --cpus-per-task=32, leading to a ~3200% CPU usage) is around 8 minutes, for processing a single 10x2160x2560 ROI. This is not great for rapid prototyping/testing, but we can live with it.

Runtime of 8 min is just for the measurement pipeline? For a single FOV or all 4? Or including running segmentation? I'd be a bit surprised if the measurement themselves take 8 min, but if they do, that's a lower priority thing to profile and optimize. As long as it runs, that's already nice :)

Iterating over ROIs will produce 2*2=4 dataframes like this. What then? Should we just concatenate them into a single one and dump into a new AnnData table next to tables/FOV_ROI_table? I guess that's the goal, but this is just to make sure.

Yes, that would also be my thinking. Concatenate per well and save a single table per object type. e.g. a tables/nuclei and at some other point a tables/organoids or tables/cells for each well.

Useful gotcha

Hmm, interesting to know! Makes sense for this one. We could think about how to handle such things in general later as part of the general napari workflows part. This looks like something where one could implement a try except in the measurement library and return NaNs in this edge case, if this happens often.

tcompa commented 2 years ago

Runtime of 8 min is just for the measurement pipeline? For a single FOV or all 4? Or including running segmentation? I'd be a bit surprised if the measurement themselves take 8 min, but if they do, that's a lower priority thing to profile and optimize. As long as it runs, that's already nice :)

It's for the script I included in my comment: just measurement, for a single FOV. I may have something wrong somewhere, of course, but the current status is 8 minutes per FOV (with a lot of CPU usage).

jluethi commented 2 years ago

Ok, may also be some suboptimal implementation of something in that measurement library. Anyway, 8min is not amazing, but let's worry about the runtime optimization for this later :)

tcompa commented 2 years ago

Just as a future reference: In a 72-FOVs well, 8 minutes per ROI (with a sequential scan of ROIs) means ~10 hours for the measurement task. And even more, if there are more than 10 Z planes.

My little experience about expected run times suggests this is too much, calling for either:

Some optimization in the measurement library;
The parallelization of the current best candidate for ROIs (https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/27);
Possibly both ;)

jluethi commented 2 years ago

One thing we could try to make it run faster for our tests is try this version that only calculates a subset of the features (putting 4 of the 6 feature options from true to false). I think this should run faster regionprops_from_existing_labels_feature_subset.txt

tcompa commented 2 years ago

One thing we could try to make it run faster for our tests is try this version that only calculates a subset of the features (putting 4 of the 6 feature options from true to false). I think this should run faster regionprops_from_existing_labels_feature_subset.txt

This leads to a ~3 minutes run time.

tcompa commented 2 years ago

One thing we could try to make it run faster for our tests is try this version that only calculates a subset of the features (putting 4 of the 6 feature options from true to false). I think this should run faster regionprops_from_existing_labels_feature_subset.txt

This leads to a ~3 minutes run time.

With this output:

     Unnamed: 0  label   area  bbox_area  convex_area  equivalent_diameter  max_intensity  mean_intensity  min_intensity  standard_deviation_intensity
0             0      1   5872      12992         7100            22.383559          547.0      251.193290           36.0                     90.392161
1             1      2  33432      61952        40385            39.968813          555.0      218.826274           23.0                     79.293774
2             2      3  29396      50688        34103            38.290985          691.0      271.551368           41.0                     94.828690
3             3      4  27844      43520        31071            37.604891          632.0      282.951982           38.0                     98.998050
4             4      5  18008      25872        19344            32.520381          476.0      224.268159           30.0                     76.368138
..          ...    ...    ...        ...          ...                  ...            ...             ...            ...                           ...
940         940    941    444        792          588             9.465163          358.0      227.630631           89.0                     49.523391
941         941    942    984       2520         1583            12.340483          300.0      155.902439           47.0                     43.710506
942         942    943   2040       4560         3164            15.735378          325.0      154.262745           42.0                     44.078668
943         943    944    116        240          173             6.050897          551.0      299.758621           69.0                     87.446301
944         944    945    276        648          417             8.077993          605.0      317.826087          127.0                    102.178972

[945 rows x 10 columns]

jluethi commented 2 years ago

Let's work with that subset then for the moment. Area, mean intensity & standard_deviation_intensity are the core measurements we'd like to see. The others could be interesting depending on use case :)

3 min still isn't an amazing runtime, of course. But that would likely be way faster for 2D measurements. 3D measurements can be a bit slow. We have implementations of measurements through ITK instead of scikit image, so we can compare whether the other library speed things up. Plus, as you mentioned above, we can look into parallelization within the well. But that's a broader discussion than just the measurement task :)

jluethi commented 2 years ago

Also, if we want to run faster tests, let's run them on the MIP dataset. I tested it mainly on those during the development of the different napari workflows and just validated that they still run on 3D. It should run < 1min for the 2D ROIs in the MIPs :)

tcompa commented 2 years ago

Here is one more step towards an actual fractal task implementation. The main to-do list now requires turning this script into a function, then into a fractal core task, then writing a test. Starting from here, we will later move towards a more general wrapper of napari workflows.

As we are still prototyping, here are a couple of comments on the current implementation shown below:

This is very long/verbose code, even if the actual "running" part only consists in a dozen LOCs in the middle of the code. We already received a similar criticism for other tasks. I'm open to suggestions on how to make this simpler, but there are several parts (some safety checks, the upscaling part, the pandas/anndata conversion) where I cannot see an immediate simplification as easy to achieve.
Re: upscaling, I'm using numpy.repeat. @jluethi, is this the same that you use for this task? If there are better ideas, let me know.
In my understanding, anndata tables should have a single dtype for all columns (am I right?). This means that we have to convert some of the int columns (e.g. area) to floats.
I tried to keep the inputs as close as possible to the structure we'll need later when porting to the new-architecture.
I think this would work also on 2D data, but I've not tested it yet.

Here is the code:

import json
import dask.array as da
import numpy as np
from napari_workflows._io_yaml_v1 import load_workflow
import anndata as ad
import pandas as pd
from anndata.experimental import write_elem
import zarr

from fractal.tasks.lib_regions_of_interest import convert_ROI_table_to_indices
from fractal.tasks.lib_zattrs_utils import extract_zyx_pixel_sizes

### THESE WILL BE THE FUNCTION INPUTS
# Inputs
in_path = "/data/active/fractal/tests/Temporary_data_UZH_1_well_2x2_sites"
component = "20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0"
labeling_channel = "A01_C01"
level = 0
metadata = dict(coarsening_xy=2, channel_list = ["A01_C01"])
workflow_file = "regionprops_from_existing_labels.yaml"
table_name = "nuclei"
######################################

# Load some variables from metadata
coarsening_xy = metadata["coarsening_xy"]
chl_list  = metadata["channel_list"]

# Find channel index
if labeling_channel not in chl_list:
    raise Exception(f"ERROR: {labeling_channel} not in {chl_list}")
ind_channel = chl_list.index(labeling_channel)

# Load zattrs file
zattrs_file = f"{in_path}/{component}/.zattrs"
with open(zattrs_file, "r") as jsonfile:
    zattrs = json.load(jsonfile)

# Try to read channel label from OMERO metadata
try:
    omero_label = zattrs["omero"]["channels"][ind_channel]["label"]
    label_name = f"label_{omero_label}"
except (KeyError, IndexError):
    label_name = f"label_{ind_channel}"

# Set level=0, to avoid possible errors, see
# https://github.com/fractal-analytics-platform/fractal/issues/69#issuecomment-1230074703
if level > 0:
    raise Exception("Measurement should be for level=0")

# Load arrays
img = da.from_zarr(f"{in_path}/{component}/{level}")[ind_channel]
label_img = da.from_zarr(f"{in_path}/{component}/labels/{label_name}/{level}")

# Find upscale_factor for labels array
img_shape = img.shape
label_shape = label_img.shape
upscale_factor = img_shape[1] // label_shape[1]
if img_shape[0] != label_shape[0]:
    raise Exception("Error in dapi/label array shapes",
                     img_shape, label_shape)

if (label_shape[1] * upscale_factor != img_shape[1] or
    upscale_factor != img_shape[2] // label_shape[2]):
    raise Exception("Error in dapi/label array shapes",
                     img_shape, label_shape,
                     f"with {upscale_factor=}")

# Upscale labels array - see https://stackoverflow.com/a/7525345/19085332
label_img_up_x = np.repeat(label_img, upscale_factor, axis=2)
label_img_up = np.repeat(label_img_up_x, upscale_factor, axis=1)
assert label_img_up.shape == img_shape

# Read FOV ROIs
FOV_ROI_table = ad.read_zarr(f"{in_path}/{component}/tables/FOV_ROI_table")

# Read pixel sizes from zattrs file
full_res_pxl_sizes_zyx = extract_zyx_pixel_sizes(
    f"{in_path}/{component}/.zattrs", level=0
)

# Create list of indices for 3D FOVs spanning the entire Z direction
list_indices = convert_ROI_table_to_indices(
    FOV_ROI_table,
    level=level,
    coarsening_xy=coarsening_xy,
    full_res_pxl_sizes_zyx=full_res_pxl_sizes_zyx,
)

# Get the workflow
napari_workflow = load_workflow(workflow_file)

print(f"Workflow file:         {workflow_file}")
print(f"Resolution level:      {level}")
print(f"Labels upscale factor: {upscale_factor}")
print(f"Whole-array shape:     {img_shape}")

# Loop over FOV ROIs
list_dfs = []
for indices in list_indices:
    s_z, e_z, s_y, e_y, s_x, e_x = indices[:]
    ROI = (slice(s_z, e_z), slice(s_y, e_y), slice(s_x, e_x))
    print(f"Single-ROI shape:      {img[ROI].shape}")

    # Set the input images: DAPI channel image & the label image for the current ROI
    napari_workflow.set("dapi_img", img[ROI])
    napari_workflow.set("label_img", label_img_up[ROI])

    # Run the workflow
    df = napari_workflow.get('regionprops_DAPI')
    list_dfs.append(df)

# Concatenate all FOV dataframes
df_well = pd.concat(list_dfs, axis=0)

# Convert all to float (warning: some would be int, in principle)
df_well = df_well.astype(np.float32)

# Drop unnecessary columns
df_well.drop(labels=["Unnamed: 0", "label"], axis=1, inplace=True)

# Reset index (to avoid non-unique labels), and set it to str
# (to avoid ImplicitModificationWarning in anndata)
df_well.index = np.arange(0, len(df_well.index)).astype(str)

# Convert to anndata
measurement_table = ad.AnnData(df_well, dtype=df_well.dtypes)

group_tables = zarr.group(f"{in_path}/{component}/tables/")  # noqa: F841
write_elem(group_tables, table_name, measurement_table)

tcompa commented 2 years ago

Question: what is a reasonable consistency check on the output of this task? Is there a simple way to visualize e.g. centroids in napari, or even better some expected property that we could check programmatically in a test?

I'm asking because we still need to design/refine our integration tests (they are currently mostly working on fake data, but later on we could have a very small dataset to run on).

jluethi commented 2 years ago

Nice, cool to see this!

This is very long/verbose code, even if the actual "running" part only consists in a dozen LOCs in the middle of the code. We already received a similar criticism for other tasks. I'm open to suggestions on how to make this simpler, but there are several parts (some safety checks, the upscaling part, the pandas/anndata conversion) where I cannot see an immediate simplification as easy to achieve.

Good discussion topic. If our wrapper for napari workflows needs to be somewhat verbose, so be it. People should be providing workflows and not have to worry about the wrapper. But in general, I think the more tasks we build, the more we will need the same functionality in different tasks (e.g. Find upscale_factor for labels array or such) and make a library function for those that the task just calls. But let's make sure it works first, before we generalize :)

Re: upscaling, I'm using numpy.repeat. @jluethi, is this the same that you use for this task? If there are better ideas, let me know.

Sounds good to me. Not sure if there are better upscaling methods for label images, but I would also go with this

In my understanding, anndata tables should have a single dtype for all columns (am I right?). This means that we have to convert some of the int columns (e.g. area) to floats.

That's fine. Yes, all values in X should probably be floats for what we do. Converting area to float is totally fine. For other things, e.g. if we store labels, it would make more sense to store them in the the .obs (which is a pandas table => column can be int, str etc).

I tried to keep the inputs as close as possible to the structure we'll need later when porting to the new-architecture.

Neat!

I think this would work also on 2D data, but I've not tested it yet.

Do we load 2D data as 3D with shape (1, len_y, len_x) or (len_y, len_x)? If we do the second, some of the axis logic for upscaling would not work (hard-set axis 1 & 2 for x & y). Otherwise, looks good to me!

Question: what is a reasonable consistency check on the output of this task? Is there a simple way to visualize e.g. centroids in napari, or even better some expected property that we could check programmatically in a test?

Very good question! We could save a small dataframe with actual measurements and use a np.assert_almost_equal check on them or something similar (e.g. https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_frame_equal.html?) For initial checks of the dataframe, we'd have to sample a few example measurements. If we have some test data, I have a workflow to visualize measurements back on label images that allows us to cross-check whether they make sense. I can manually check and then we can just ensure that those measurements don't change. Maybe the 2x2 test case is reasonably small? Or would we want to go for something smaller (e.g. smaller images maybe? Just one FOV?)

I will look through the rest of the code later today and give feedback then :)

tcompa commented 2 years ago

I will look through the rest of the code later today and give feedback then :)

Just don't run the current version (https://github.com/fractal-analytics-platform/fractal/commit/5ad77900fd65e7760b3e07ecb3f6d746bb6f3607), since there's still some glitch when passing from pandas to anndata and then writing to zarr.

jluethi commented 2 years ago

@tcompa I just read through the code more carefully, didn't run anything myself yet. I think the convert_ROI_table_to_indices function is a great example of how we can slowly generalize some of the preparation code into library code & simplify the tasks :) For the indexing: have you checked the slice sizes? Not sure how the input indices are generated, just checking that we don't run into an off-by-one error there

You're dropping the label column: It will be very important that we have those saved to the AnnData object. Probably best as part of obs. I have some example code of saving it there somewhere. Basically, we can set the obs to a dataframe, here probably a dataframe containing the label column. We use the label column to match measurements to the actual objects in the images. They are not a measurement, but critical metadata. Using something like df_well.index = np.arange(0, len(df_well.index)).astype(str) can be tricky (if this was meant to be the labels). Resetting the index is a good idea. But if we use it to match measurement to objects, we have two assumptions that may not hold: 1) That we will always get the dataframes ordered the same as the labels 2) That the labels are always consecutive, i.e. that no label is missing (which they are in many workflows).

Other than that, looks great! Looking forward to testing it once we have it Fractal (or earlier, if you want to ping me once you have a version that will run without the pandas to anndata issue)

tcompa commented 2 years ago

For the indexing: have you checked the slice sizes? Not sure how the input indices are generated, just checking that we don't run into an off-by-one error there

The FOV-ROI behavior is documented in https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/18. The four FOVs (in this script) are:

(slice(0, 10, None), slice(0, 2160, None), slice(0, 2560, None))
(slice(0, 10, None), slice(0, 2160, None), slice(2560, 5120, None))
(slice(0, 10, None), slice(2160, 4320, None), slice(0, 2560, None))
(slice(0, 10, None), slice(2160, 4320, None), slice(2560, 5120, None))

which looks correct to me. To make sure there's no missing pixel, you can run

x = ["a", "b", "c", "d"]
assert x[slice(0, 2)] == ["a", "b"]
assert x[slice(2, 4)] == ["c", "d"]

You're dropping the label column: It will be very important that we have those saved to the AnnData object. Probably best as part of obs. I have some example code of saving it there somewhere. Basically, we can set the obs to a dataframe, here probably a dataframe containing the label column. We use the label column to match measurements to the actual objects in the images. They are not a measurement, but critical metadata. Using something like df_well.index = np.arange(0, len(df_well.index)).astype(str) can be tricky (if this was meant to be the labels). Resetting the index is a good idea. But if we use it to match measurement to objects, we have two assumptions that may not hold: 1) That we will always get the dataframes ordered the same as the labels 2) That the labels are always consecutive, i.e. that no label is missing (which they are in many workflows).

I mixed up labels and indices. And since for indices I was getting a "duplicated entry" (or similar) error upon concatenating dataframes (because they restart from the beginning, for each FOV), one solution was to simply reset them.

I'll put back labels (as indices of the df) and update this issue.

Looking forward to testing it once we have it Fractal (or earlier, if you want to ping me once you have a version that will run without the pandas to anndata issue)

This was a trivial dtype issue, solved with https://github.com/fractal-analytics-platform/fractal/commit/68e5d4ee72df023d5a60368b9c1b590dc740e857. If you'd like to test it outside fractal, go ahead.

tcompa commented 2 years ago

Naively applying this same script to a MIP dataset fails, with errors including

  return convex_hull_image(self.image)
/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/skimage/measure/_regionprops.py:395: UserWarning: Failed to get convex hull image. Returning empty image, see error message below:
QH6013 qhull input error: input is less than 3-dimensional since all points have the same x coordinate    0

This is no surprise, as written in https://github.com/scikit-image/scikit-image/blob/c95a8fd46a49a38de1f6032c84d9d61160b78283/skimage/measure/_regionprops.py#L1034-L1039:

     .. versionchanged:: 0.14.1
            Previously, ``label_image`` was processed by ``numpy.squeeze`` and
            so any number of singleton dimensions was allowed. This resulted in
            inconsistent handling of images with singleton dimensions. To
            recover the old behaviour, use
            ``regionprops(np.squeeze(label_image), ...)``.

I've fixed it by hand (locally), by adding an extra check of whether data are 2D or 3D.

Small gotcha: our Z axis is regionprops' X axis, which made the error message confusing at a first look.

tcompa commented 2 years ago

The fix in https://github.com/fractal-analytics-platform/fractal/commit/0fe5d6c2f29f0031b0c3216865794b8d61e0dd42, to correctly handle labels, isn't the right one yet. After I save some nuclei ROIs, if I then look at them with

import anndata as ad
adata = ad.read_zarr("/data/active/fractal/tests/Temporary_data_UZH_1_well_2x2_sites/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/tables/nuclei")

I receive a warning

/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/anndata/_core/anndata.py:1828: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")

And in fact if I look at the number of labels,

num_labels_unique = len(set(list(adata.obs_names)))
num_labels = len(list(adata.obs_names))

I find (for a specific MIP example)

num_labels=2993
num_labels_unique=2914

More digging is needed (also to see whether this is common to 2D-whole-well labeling and 3D-per-FOV labeling).

tcompa commented 2 years ago

Re: last comment (issue with duplicate labels)

This issue does not appear for the example with 3D data (segmented per-FOV, and then relabeled). Thus debugging is needed for the 2D case.

jluethi commented 2 years ago

Hmm, that's very suspicious indeed. I have one idea what it could be here: Is the MIP segmented as a whole well, but then processed ROI by ROI (=> FOV by FOV)? In that case, we would get measurements for labels that cross the border from each ROI, thus duplicates. While if we run 3D segmentation, that is typically run per FOV, thus has unique labels per FOV (& maybe cells that are cut in half at the border between FOVs, but they have unique labels).

Conclusion: We should be consistent in the processing approach/parallelisation scheme from segmentation onward. If we segment per well, we should also make measurements per well.

tcompa commented 2 years ago

Gotcha, the problem with 2D is the following:

If we segment a whole-well image, and then split labels into four FOVs, all labels that overlap with FOV boundaries will replicated at least twice. A quick fix (https://github.com/fractal-analytics-platform/fractal/commit/51d4b43b2318d3064c569e2c71d61b0815843125) is to add a whole_well=True/False flag:

If True, we only consider a single ROI (for the entire well), after checking that this is 2D;
If False, we consider the usual FOV-ROIs.

Comments?

jluethi commented 2 years ago

Yes, I think that should work (see comments above, we got there at the same time it seems). If segmentation has been performed for the whole well, so should the measurements.

We can discuss whether we should actually define an explicit ROI for the whole well, then this would just be a question of choosing the correct ROI (/parallelization level). It's not necessarily required to have a ROI per well, but it would generalize this kind of logic. Plus, once we get into multiplexing registration, it would simplify the logic of defining the shared area between all cycles.

There would be some overhead to this though of having to update these ROIs if any dimensions change. But I do think dimension changes are rare and when they happen, a well ROI would actually be useful to have.

tcompa commented 2 years ago

We can discuss whether we should actually define an explicit ROI for the whole well, then this would just be a question of choosing the correct ROI (/parallelization level). It's not necessarily required to have a ROI per well, but it would generalize this kind of logic. Plus, once we get into multiplexing registration, it would simplify the logic of defining the shared area between all cycles.

There would be some overhead to this though of having to update these ROIs if any dimensions change. But I do think dimension changes are rare and when they happen, a well ROI would actually be useful to have.

Let's move this discussion to a new issue?

jluethi commented 2 years ago

Sounds good, I'll create one for this discussion. For the moment, a whole_well flag should do the trick for the measurements :)

jluethi commented 2 years ago

=> let's have the well ROI discussion here: https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/16

jluethi commented 2 years ago

I've fixed it by hand (locally), by adding an extra check of whether data are 2D or 3D.

Does this only occur if we load the image data and the label data differently? Or do both img & label_img need to be 2D if the z axis is only of size 1?

Small gotcha: our Z axis is regionprops' X axis, which made the error message confusing at a first look.

Ah, that's an annoying one! Good catch! May have some downstream implications regarding getting measurements like x-coordinate etc. that would be wrong then. Would need to look into the measurement library though to see how general this is.

tcompa commented 2 years ago

With https://github.com/fractal-analytics-platform/fractal/commit/3af3362e441d44816002f299c55b608e96c44533, I moved the measurement task to a function, in tasks/measurement.py. By now tests can be run as in tasks/Sandbox_measurement/call_measurement_{2,3}D.py (this folder will be removed at some point).

A code review of measurement.py and/or some other tests would be useful, at this point, before trying to add this to fractal.

tcompa commented 2 years ago

I've fixed it by hand (locally), by adding an extra check of whether data are 2D or 3D.

Does this only occur if we load the image data and the label data differently? Or do both img & label_img need to be 2D if the z axis is only of size 1?

At the moment, I require the image and label arrays to have the same size in the Z dimension, and throw an Exception otherwise. As for the XY axes, their sizes can be different between label/image arrays (of course, because upscaling may be needed), but the upscale factor must be the same for X and Y.

EDIT To better answer your question: since the image/label arrays have the same number of dimensions, it doesn't matter which one we check.

jluethi commented 2 years ago

A code review of measurement.py and/or some other tests would be useful, at this point, before trying to add this to fractal.

Sounds good. I will run it myself with your example today, check whether the measurements make sense and do a code review.

Let's have the discussion here fractal-analytics-platform/fractal-tasks-core#88 on what tests we should add.

At the moment, I require the image and label arrays to have the same size in the Z dimension, and throw an Exception otherwise. As for the XY axes, their sizes can be different between label/image arrays (of course, because upscaling may be needed), but the upscale factor must be the same for X and Y.

Sounds good

To better answer your question: since the image/label arrays have the same number of dimensions, it doesn't matter which one we check.

Ok, then we need this additional logic for 2D :) Just wanted to check it wasn't a label vs img dimension mismatch.

tcompa commented 2 years ago

Minor issue: measurement.py appears as a very general name to me. Do you find it OK or can we set it to something slightly more specific?

jluethi commented 2 years ago

Hmm... I'd imagine this part of the code eventually generalizes to the napari workflow wrapper. Measurements is one type of output, other outputs may be label images. Because what the function actually does isn't measurement specific, it's more wrapping everything to pass the correct inputs to the napari workflow and then handles its output.

=> I'd say measurement.py is fine for the moment, and eventually it may be napari_workflows.py or something like that?

jluethi commented 2 years ago

Both the 2D & 3D pipeline ran through for me. The 3D is a bit slow, as you pointed out above, 2D runs very fast. I suspect that's just the measurement library not being optimized.

I did a simple sanity check of the measurements in 2D by checking some areas and confirming that all the large cells have large areas, medium cells have medium areas and small cells have small areas. See this example visualization:

Screenshot 2022-08-30 at 17 45 27

(also, tested a workflow to apply feature visualization on single wells in a jupyter notebook, works well after some kinks are worked out)

Same tests for the mean intensity measurements of DAPI, looks very reasonable: Screenshot 2022-08-30 at 17 49 39 Screenshot 2022-08-30 at 17 50 08

All using viridis colormap, yellow = high value: Screenshot 2022-08-30 at 17 49 29

The mitotic cells (the very small, elongated ones) have the highest mean intensities as we would expect. Plus, the DAPI bias in the image (cells on the right are brighter, I think because we're not using all the Z planes in the example) is clearly visible in the quantification.

Conclusion: The mapping of measurements to labels is correct. I think testing whether the exact correct value is measured is more of a concern of the actual measurement libraries. As far as I'm concerned, we can take the 2D measurements as a baseline that should not change.

One thing I would still improve (that lead to quite some issues in building the visualiztion workflow until I realized): We should save label values as int in AnnData.obs, not just as the index. Many workflows, e.g. the colormapping, expect int indices, but the index is a string. Once we generalize this to wrap any napari workflow, we'll have to think whether there will always be a 'label' column (some measurement libraries call it Label, maybe some library comes up with yet another name). If saving the label as int in obs isn't trivial, then let's skip this for the moment and just take note we'll eventually need to handle that.

I'll have a few more observations later on how we access channels, what goes into .zattrs. But they aren't specific to the measurement task, so shouldn't stop us from integrating this into Fractal.

jluethi commented 2 years ago

Small code review feedback. Looks good to me as an initial implementation (besides the label as str issue above, if we can still fix this, though maybe we need to discuss if there is actually a generic fix to that or if this is somewhat of a special case).

Here's a few places where we'll need to think of generalizations for fractal-analytics-platform/fractal-tasks-core#31

https://github.com/fractal-analytics-platform/fractal/blob/8ea75453c0b4f66f7d9d4d942fbf6c40c47e607e/fractal/tasks/measurement.py#L107-L110

2D & whole-well don't necessarily need to be an either or. One could process a whole (small) well in 3D at e.g. pyramid level 4.

https://github.com/fractal-analytics-platform/fractal/blob/8ea75453c0b4f66f7d9d4d942fbf6c40c47e607e/fractal/tasks/measurement.py#L113-L115

Once we generalize this, we should let the user input the ROIs that are relevant (e.g. FOV ROIs, Organoid ROI, Zebrafish ROI, maybe well ROI).

https://github.com/fractal-analytics-platform/fractal/blob/8ea75453c0b4f66f7d9d4d942fbf6c40c47e607e/fractal/tasks/measurement.py#L156

We'll need to define a generic way of how we define input/outputs for napari workflows, see ideas of a dictionary for it here https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/31. Alternatively, there are ways to read what inputs & outputs are from the workflow file (if it has been set up that way, see: https://github.com/haesleinhuepf/napari-workflows/issues/27), but not what that would match to (=> would be a way to check that we don't forget a required input and trigger a warning if we don't save an output that was generated).

tcompa commented 2 years ago

I moved the general comments to fractal-analytics-platform/fractal-tasks-core#31, and fixed the issue "2D & whole-well don't necessarily need to be an either or. One could process a whole (small) well in 3D at e.g. pyramid level 4."

What is missing is only the labels-as-int issue (I first need to verify how to do in AnnData).

jluethi commented 2 years ago

Great, thanks @tcompa

One way I've used AnnData for this in the past is to set anndata_obj.obs = df['label'].astype(int) Maybe that's a good way to go for the moment and, when we generalize the workflows in fractal-analytics-platform/fractal-tasks-core#31 , we evaluate whether this works well generally

tcompa commented 2 years ago

One way I've used AnnData for this in the past is to set anndata_obj.obs = df['label'].astype(int) Maybe that's a good way to go for the moment and, when we generalize the workflows in fractal-analytics-platform/fractal-tasks-core#31 , we evaluate whether this works well generally

I've added this change in 693ea207e0aaf926b06183ea48e153d4f675f2cb. This leads to this behavior:

>>> import anndata as ad
>>> table_path = "/data/active/fractal/tests/Temporary_data_UZH_1_well_2x2_sites/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/tables/nuclei"
>>> adata = ad.read_zarr(table_path)
/data/homes/fractal/.conda/envs/fractal/lib/python3.8/site-packages/anndata/_core/anndata.py:121: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
>>> adata.obs
       label
label       
1          1
2          2
3          3
4          4
5          5
...      ...
2910    2910
2911    2911
2912    2912
2913    2913
2914    2914

[2914 rows x 1 columns]
>>> adata.var
Empty DataFrame
Columns: []
Index: [area, bbox_area, convex_area, equivalent_diameter, max_intensity, mean_intensity, min_intensity, standard_deviation_intensity]
>>> adata
AnnData object with n_obs × n_vars = 2914 × 8
    obs: 'label'

@jluethi, does it match with your expectations?

Also notice the ImplicitModificationWarning, which is the reason why I had switched to int's in the first place. Related resource: https://github.com/scverse/anndata/issues/311

tcompa commented 2 years ago

After our chat with @jluethi leading to https://github.com/fractal-analytics-platform/fractal/commit/83b3dfc1d63c4b979dd0bc1556f7a94be3d028a2, this version of task is complete. I'm closing this one, and opening what should be a very simple issue at https://github.com/fractal-analytics-platform/fractal/issues/164