fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
12 stars 6 forks source link

Introduce import-ome-zarr task #557

Closed tcompa closed 10 months ago

tcompa commented 10 months ago

close #521 close #558 close #559 close #564

Checklist before merging

github-actions[bot] commented 10 months ago

Coverage report

The coverage rate went from 90.23% to 90.34% :arrow_up: The branch rate is 83%.

95.86% of new lines are covered.

Diff Coverage details (click to unfold) ### fractal_tasks_core/lib_ngff.py `100%` of new lines are covered (`94.66%` of the complete file). ### fractal_tasks_core/lib_pyramid_creation.py `100%` of new lines are covered (`100%` of the complete file). ### fractal_tasks_core/lib_regions_of_interest.py `100%` of new lines are covered (`99.23%` of the complete file). ### fractal_tasks_core/tasks/import_ome_zarr.py `92.53%` of new lines are covered (`86.31%` of the complete file). Missing lines: `64`, `66`, `150`, `222`, `224` ### fractal_tasks_core/tasks/maximum_intensity_projection.py `100%` of new lines are covered (`89.28%` of the complete file).
tcompa commented 10 months ago

Here are few discussion point for the upcoming discussion.

I/O structure

Take decision on input_paths. Due to the current I/O structure, the most consistent way would be to have

    input_paths = ["/some/parent/folder"]
    zarr_name = "plate.zarr"  (or "plate.zarr/B/03/0", or "image.zarr")

PRO: This is immediately compatible with the other tasks (i.e. we only need to define a single Fractal dataset, and keep using it both as an input and output, for job execution). CON: Cumbersome (but this is not strictly related to this task)

Parameters

Current list of parameters is

    input_paths: Sequence[str],
    output_path: str,    # NOT USED
    metadata: dict[str, Any],  # NOT USED
    scope: Literal["plate", "image"] = "plate",
    add_image_ROI_table: bool = True,
    add_grid_ROI_table: bool = True,
    grid_ROI_shape: Optional[tuple[int, ...]] = None,

with the potential addition of zarr_name (or something like that). Is this OK, or do we want to simplify it?

Side-note: the process of importing an external OME-Zarr may always come with some friction, in several "complex" cases (see e.g. discussion in #558). It's not surprising if the task has "many" parameters, provided that they have reasonable defaults that work for the "non-complex" cases.

Type of grid shape

The type of grid_ROI_shape affects how it's displayed/validated in fractal-web. Some options:

  1. If we use a list/tuple/array, this shows as an array in the web client. It's not great, because it suggests that you can modify its length, but it has the nice advantage that such an error (trying to provide an array of length 3, for instance) will fail immediately, at the form-validation stage. Note that this would be fixed by integrating more JSON Schema features in fractal-web (ref https://github.com/fractal-analytics-platform/fractal-web/issues/183).
  2. We can use a string with a specific format, like 3x2. This is much nicer to see in the web form, but it has no immediate way of validating it:
    • We can perform validation within the task, but this postpones it to a much later stage (i.e. a failed job)
    • We can rely on some slightly more complex typing features (e.g. the use of Field, or Annotated), which has a bit of overhead.

Scope parameter

"scope" is a provisional name, and other suggestions are welcome.

My idea is that we support plates by default, but someone may want to try some Fractal tasks (e.g. cellpose) on a single-image OME-Zarr, instead of preparing a full plate.

To discuss:

tcompa commented 10 months ago

I'd say this is mostly ready for a review.

A brief recap of some choices:

  1. MIP task was simplified a lot, as discussed in #558
  2. For the issue with the tuple task parameter (ref #564), I opted for the simplest version (two integer arguments)
  3. NGFF type (plate/well/image) is detected automatically. If it differs from "plate", we print a warning about partial support.
  4. Advanced uses (e.g. providing a pre-made ROI table) can be added in the future.
  5. The internal ROI-generating functions are not fully flexible (as in "ZXY vs YX"), as described here: https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/150#issuecomment-1759342654.
  6. A preliminary test for importing BIA data does work, but is currently skipped - see https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/565.
  7. Right now I have overwrite=False when writing the new tables. In principle we can also expose it as an overwrite task parameter.
tcompa commented 10 months ago

MIP task was simplified a lot

This is already tested in the CI, but I also tested it in fractal-containers (as part of the new PR which includes both examples 01 and 02 from fractal-demos). More specifically, I used these branches:

FRACTAL_SERVER_GIT=674-remove-automatic-creation-of-default-dataset-when-creating-a-project
FRACTAL_CLIENT_GIT=565-align-with-fractal-server-new-feature-project-creation-does-not-create-a-dataset
FRACTAL_TASKS_CORE_GIT=521-introduce-import-ome-zarr-task
FRACTAL_WEB_GIT=fix-discard-changes-glitch
FRACTAL_DEMOS_GIT=67-align-with-new-server-feature-project-creation-not-creating-a-dataset-by-default

and both examples 01 and 02 went through.

tcompa commented 10 months ago

For my understanding: The actual processing in the MIP task did not change, right? Just a bunch of initial validation and checks on indices? But it's still run the same as before, relying on dask for the memory management? Looks much simpler now :)

I was also quite surprised in learning that the ROI-related block looks useless (it looks like it's just mirroring a check that made sense for the illumination-correction task only).

I now realize that it's a bit more subtle: we were using all this FOV-related logic to extract the image size (and that's why we need to verify that all images have the same size - otherwise it's meaningless). This image size is then used as a chunksize

    # old code
    chunk_size_y, chunk_size_x = img_size[:]
    chunksize = (1, 1, chunk_size_y, chunk_size_x)
    ...
    build_pyramid(
        zarrurl=zarrurl_new,
        overwrite=overwrite,
        num_levels=num_levels,
        coarsening_xy=coarsening_xy,
        chunksize=chunksize,
    )

The idea was to always enforce the assumption that each zarr chunk has the same size as a single image, which could be needed if we ever switch to parallel processing of multiple ROIs in other tasks (e.g. illumination correction).

I'm still in favor of simplifying the MIP task as in the current PR, with one minor addition and a scope comment:

  1. If the chunk=FOV assumption is true for the 3D array, we can make it true for the 2D by just re-using the same YX chunk size. And in fact I think this is a better choice than the current one in the PR, where we are letting dask some freedom on chunksizes (I'm confident there is a deterministic choice under the hood, and possibly it's exactly the one we would enforce, but it's currently not transparent when one reads the task code).
  2. I think that checking/modifying chunksize of an existing zarr is out of scope for import-ome-zarr. If we want to enforce strict rules on chunksizes, those should be explicitly verified at the beginning of each task.
jluethi commented 10 months ago

If the chunk=FOV assumption is true for the 3D array, we can make it true for the 2D by just re-using the same YX chunk size. And in fact I think this is a better choice than the current one in the PR, where we are letting dask some freedom on chunksizes (I'm confident there is a deterministic choice under the hood, and possibly it's exactly the one we would enforce, but it's currently not transparent when one reads the task code).

Sounds good to me. We currently don't expose chunking to the user. I think staying with the same chunking as the input data is very reasonable. Aren't we already achieving this with the chunksize=accumulated_array.chunksize call? If not, I'd be in favor of specifying this. Once Zarr v3 & sharding come into play, we'll want to explore chunking and its trade-offs more :)

I think that checking/modifying chunksize of an existing zarr is out of scope for import-ome-zarr. If we want to enforce strict rules on chunksizes, those should be explicitly verified at the beginning of each task.

Agreed. Our tasks should work independent of the chunking approach (especially in light of Zarr v3 & sharding). If there are optimal ways of using them, we should test the performance trade-offs and potentially have warnings if there are issues. But I'd want our tasks to still run even if chunking is suboptimal

tcompa commented 10 months ago

Aren't we already achieving this with the chunksize=accumulated_array.chunksize call? If not, I'd be in favor of specifying this.

It depends on whether operations in

    mip_yx = da.stack([da.max(data_czyx[ind_ch], axis=0)], axis=0)
    accumulate_chl.append(mip_yx)
    accumulated_array = da.stack(accumulate_chl, axis=0)

are all chunksize-preserving.

I guess so (and a quick single-channel test had already showed that accumulated_array satisfies the chunk=FOV assumption for the YX axes), but having lines like

    chunksize_y = data_czyx.chunksize[-2]
    chunksize_x = data_czyx.chunksize[-1]
    ...
    chunksize=(1, 1, chunksize_y, chunksize_x)

makes it quite more transparent, without adding too much clutter.

Another benefit of the new version: with the chunksize=accumulated_array.chunksize choice, I cannot easily guess what chunksize will be chosen along the C channel - which is the outcome of a da.stack operation.