fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Update feature tables as per table specifications, in napari-workflows-wrapper task #593

Closed tcompa closed 9 months ago

tcompa commented 11 months ago

When producing feature tables from the napari-workflows-wrapper task, we are currently not complying with our own feature-table specifications, which are being defined as part of #582.

We currently write a measurement table into a Zarr group as:

        write_table(
            image_group,
            table_name,
            measurement_table,
            overwrite=overwrite,
            logger=logger,
        )

without including any table_attrs. We should start to also use the attributes that were proposed in https://github.com/ome/ngff/pull/64 (e.g. type, region, instance_key).

tcompa commented 10 months ago

I realized that it's not so easy to comply with our table specs V1 for feature tables created with the napari-workflows task.

In the specs, we require this kind of attibutes:

    "type": "feature_table",
    "region": { "path": "../labels/label_DAPI" },
    "instance_key": "label",

where region/path points to the label image where measurements are being performed. This information is not directly available in the napari-workflows-wrapper task, because it's defined as part of the workflow file.

A concrete example

As an example we may have this workflow

!!python/object:napari_workflows._workflow.Workflow
_tasks:
  regionprops_DAPI: !!python/tuple
  - !!python/name:napari_skimage_regionprops._regionprops.regionprops_table ''
  - dapi_img
  - dapi_label_img
  - true
  - true
  - false
  - false
  - false
  - false

combined with the following arguments of the fractal task

    # Prepare parameters for second napari-workflows task (measurement)
    workflow_file = str(testdata_path / "napari_workflows/wf_4.yaml")
    input_specs = {
        "dapi_img": {"type": "image", "channel": {"wavelength_id": "A01_C01"}},  # type: ignore # noqa
        "dapi_label_img": {"type": "label", "label_name": "label_DAPI"},  # type: ignore # noqa
    }
    output_specs = {
        "regionprops_DAPI": {  # type: ignore # noqa
            "type": "dataframe",
            "table_name": "regionprops_DAPI",
        },
    }

Extract expected value of region["path"] by direct (AKA human) inspection

By looking at these two snippets, and thanks to our previous context knowledge, we know that the relevant workflow input is dapi_label_img, while dapi_img is the intensity image. Then we can (by hand) make a connection between the workflow output regionprops_DAPI and the workflow input dapi_label_img. Finally, we can use the input_specs attribute, and learn that the correct value for region["path"] is "../labels/label_DAPI".

Automatic extraction of region["path"]

We may want to try to mimic the direct inspection in an automated way (even if we may have to handle some complex edge cases), but there is one step where I don't know how to proceed, namely this one:

we know that the relevant workflow input is dapi_label_img

How could a task be able to guess this information is unclear to me.

Options

cc @jluethi

(A) We add one more argument

The only reliable way I can guess is that we ask for the value of region["path"] as part of the output_specs, when type="dataframe". Then the new task arguments would be

    # Prepare parameters for second napari-workflows task (measurement)
    workflow_file = str(testdata_path / "napari_workflows/wf_4.yaml")
    input_specs = {
        "dapi_img": {"type": "image", "channel": {"wavelength_id": "A01_C01"}},  # type: ignore # noqa
        "dapi_label_img": {"type": "label", "label_name": "label_DAPI"},  # type: ignore # noqa
    }
    output_specs = {
        "regionprops_DAPI": {  # type: ignore # noqa
            "type": "dataframe",
            "table_name": "regionprops_DAPI",
            "region_path": "label_DAPI",    # <-------------- new attribute
        },
    }

and it would be up to the user to provide the correct value.

(B) We don't comply with specs

The other option is that this task does not comply with our specs, at least for the moment. This is a bit unfortunate, because it's our only task that generates feature tables, and then the feature-table specs are not really relevant and we could likely drop them.

(C) We relax the specs

We can also modify the specs, so that region is not required, and then the task complies with them. The downside here is that the new feature tables would lack the required information to link back to another table (the one with labels), and then they would end up being little more than a standard table (e.g. they'd just have an additional instance_key="label" attribute, and an additional obs attribute listing all labels).

(D) Better ideas?

jluethi commented 10 months ago

This gets even more complicated for the napari workflow case, because the workflow itself may be creating the label image that measurements are made on.

To be pragmatic, I'd say we go with Option A for the moment and include help text for the region_path: "name of the label images that the feature measurements are based on"

That just gets added to the metadata, so we create example OME-Zarrs that are compliant with our spec with our example workflows.


Just for the record, some inference would be possible:

we know that the relevant workflow input is dapi_label_img

Our inputs & outputs have types, so we could try some inference on "type": "label" in the input & output specs. But there is the risk that e.g. the workflow loads a label image, modifies it and then makes measurements. Or that a label image is created by the workflow, but not stored as an output (I'd not recommend either, but it's possible in this flexible workflow setup).

There are some fancier scenarios we could consider: we can optionally ask the user to add a region_path to the output specs and otherwise try inference based on "type": "label" in input & output spec (with the priority order being: 1) user-specified, 2) output spec, 3) input_spec). I don't think it's currently worth adding that complexity.

For me, the main question is how much time we want to invest into our napari workflow wrapper. Given some uncertainty over the stability and continued investment into napari workflows, I'd limit it for the time being => just go with Option A


then the feature-table specs are not really relevant and we could likely drop them.

I still think the spec is an overall good idea. And it should get used by the scmultiplex measurements, which are much more straightforward: Based on a label image, create measurements and save them to a table. Those are what I currently use as a default measurement task and we can make them spec compliant.

Plus, this spec will enable interesting downstream functionality in napari plugins to automatically associate measurements to the correct label layer, which would be very useful.

tcompa commented 10 months ago

To be pragmatic, I'd say we go with Option A for the moment and include help text for the region_path: "name of the label images that the feature measurements are based on"

Any preference between the following two options?

  1. introducing a new region_path attribute, with the role we just described. This has the pro of offering flexibility, rather than always enforcing the structure "../labels/{label_name}".
  2. re-using the label_name attribute, which already exists both in NapariWorkflowsInput and NapariWorkflowsOutput

I'd rather go with 2, because it offers an intuitive way of setting that parameter in the most common cases:

Examples (within option 2):

# Label already exists
input_specs = {
    "dapi_img": {
        "type": "image",
        "channel": {
            "wavelength_id": "A01_C01"
        }
    },
    "dapi_label_img": {
        "type": "label",
        "label_name": "label_DAPI"
    },
}
output_specs = {
    "regionprops_DAPI": {
        "type": "dataframe",
        "table_name": "regionprops_DAPI",
        "label_name": "label_DAPI",
    },
}

# Label is computed within the same workflow (warning: I did not test this)
input_specs = {
    "input": {
        "type": "image",
        "channel": {
            "wavelength_id": "A01_C01"
        }
    }
}
output_specs = {
    "Result of Expand labels (scikit-image, nsbatwm)": {
        "type": "label",
        "label_name": "label_DAPI",
    },
    "regionprops_DAPI": {
        "type": "dataframe",
        "table_name": "regionprops_DAPI",
        "label_name": "label_DAPI",
    },
}
tcompa commented 10 months ago

Option 2 is currently implemented through https://github.com/fractal-analytics-platform/fractal-tasks-core/commit/7043d64a9c0bce799ddf237e4bb42828acd866ce. The relevant bit of the task looks like this image

jluethi commented 10 months ago

label_name sounds good to me :)