fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Introduce `import-ome-zarr` task #521

Closed tcompa closed 11 months ago

tcompa commented 12 months ago

This non-parallel task (or parallel at the plate level, provided we implement https://github.com/fractal-analytics-platform/fractal-server/issues/802) would take a single ome-zarr plate.zarr as an input.

Its goals (sorted by increasing complexity) are to:

  1. Generate the appropriate component lists (plate/image/well), and include them in its metadata_update output - to be then parsed by fractal-server and placed in Dataset.meta (soon to become something like Dataset.components - TBD). NOTE: This functionality requires browsing through groups according to the OME-NGFF structure, and constructing the lists based on the zarr attributes. Ideally we should never use glob or anything similar, but only proceed through zarr groups/subgroups/attributes.
  2. Generate a ROI table (with a single entry, analogous to the current well_ROI_table) for each ome-ngff image in plate.zarr, by combining the pixel sizes and array size into a physical-unit ROI. NOTE 1: This functionality needs to happen per-image (even if it belongs to a higher-level task). NOTE 2: The well_ROI_table name would not be fully appropriate, because it assumes the well=image correspondence (which is true for Fractal, but not in general). Maybe we can find a more generic name (image_ROI_table, full_ROI_table, main_ROI_table, ...).
  3. Generate a second ROI table, which splits the main ROI into several others ROIs. The default here could be to generate a grid of ROIs (possibly with user-provided size, e.g. 7x9). The goal is e.g. to allow 3D processing of smaller dataset regions, e.g. in the cellpose task.

Side-note: we currently have a fixture like:

@pytest.fixture(scope="session")
def zenodo_zarr_metadata(testdata_path):
    metadata_3D = {
        "plate": ["plate.zarr"],
        "well": ["plate.zarr/B/03"],
        "image": ["plate.zarr/B/03/0/"],
        "num_levels": 6,
        "coarsening_xy": 2,
        "original_paths": [str(testdata_path / "10_5281_zenodo_7059515/")],
        "image_extension": "png",
    }

    metadata_2D = ...

for our tests. Together with #212, the current issue should aim at making it fully obsolete, since the relevant information would be gathered elsewhere.

jluethi commented 12 months ago

Great summary of the new copy task!

or parallel at the plate level

True and running on the plate level is sensible. But we couldn't use plate level in the current server implementation (because it would look for the plate key in metadata/components). Thus, this may run without a parallelization level and e.g. take the name of the Zarr file (that is in the input_paths folder) as an argument).

NOTE 1: This functionality needs to happen per-image (even if it belongs to a higher-level task).

True. But I think everything can be done based on zarr metadata (both the pixel sizes as well as the dimensions in pixels are contained in the metadata). Thus, I would say a task that runs on the plate level and just loops over all images is fine.

NOTE 2: The well_ROI_table name would not be fully appropriate, because it assumes the well=image correspondence (which is true for Fractal, but not in general). Maybe we can find a more generic name (image_ROI_table, full_ROI_table, main_ROI_table, ...).

Agreed! Let's go with image_ROI_table.

tcompa commented 11 months ago

When acting on an existing OME-Zarr, is the import-ome-zarr task expected to modify it? That is, is it OK to add ROI tables to an existing OME-Zarr?

jluethi commented 11 months ago

it OK to add ROI tables to an existing OME-Zarr?

I would say yes, at least for that first implementation. ROI table are core to the further processing in Fractal, so there needs to be a good way to add them.

Should this be optional? Maybe, but default to on => create OME-Zarr tables


Thinking through some scenarios: In what scenarios would this not work? If there is a read-only OME-Zarr somewhere. But then: Where would the Fractal workflow save any output (as it needs to go into the OME-Zarr) => there may be workflows that require some type of Copy OME-Zarr (or subset of it) before it's on storage that Fractal can write to.

In that case: Import OME-Zarr without writing ROI tables. Then copy OME-Zarr (subset?). Then add ROI tables (e.g. through import OME-Zarr again, but this time importing the newly created copy?) Or have the copy task gain that functionality?


Are there other reasons to modify an OME-Zarr during import?

  1. Modifying the image data => that should be the job of tasks afterwards
  2. Modifying the label data => that should be the job of tasks afterwards
  3. Modifying metadata: Maybe. But we could eventually do e.g. a Pydantic NGFF model check on the OME-Zarr we're importing. And even further down the road, maybe there's a set of things where Fractal pushes beyond the current spec => import would add that relevant metadata if it's required.

Modifying metadata may actually be something we'd want to allow, e.g. to add the wavelength_id parameter to the Omero tables. Not sure whether we'll work well with OME-Zarrs without that one.


My conclusion:

  1. The "Create OME-Zarr ROI table" functionality should probably be a library function that could be used in different places
  2. The Import OME-Zarr task by default should create OME-Zarr ROI tables
  3. There may be additional metadata modifications that could become necessary in the future.