Support arrays of mixed-dimensionality

tcompa commented 1 year ago

At the moment all our image arrays are 4D (CZYX) and each one of our label arrays is 3D (ZYX). This property is visible in the .zarray files, and in the folder structure. When the dimension along Z is dummy (a single Z plane), we still use the 4D/3D structure, with shape like (num_channels, 1, num_y, num_x) or (1, num_y, num_x). Also ROIs are defined in the same way: they are always 3D shapes (defined by 6 numbers), and in some cases the Z part is dummy (starting at 0 and ending at pixel_size_z, corresponding to a single pixel).

The perspective is that we will handle arrays with mixed dimensions, which can be up to 5D (TCZYX) but also lack some of the intermediate channels (like TCYX), see https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/149#issuecomment-1289379988:

Also, while we are not tackling time-data yet, maybe we should start thinking about this topic for such design decisions. Eventually, we will also process time-resolved data, so data may be 2D, 3D, 4D or e.g. 2D + time (=> 3 actual dimensions, but maybe saved as 4D with Z dimension = 1)

Broadly speaking, a possible (preliminary!) plan to support this general case would be to

Have some custom handling of the dimensionality in the zarr-creation tasks.
Consistently use named axis in all other tasks.
Make sure that the relevant functions/tasks are capable of handling arrays of different shapes

Re: point 1 This means that create_zarr_structure and yokogawa_to_zarr would include more logic, to choose the right structure of the target zarr array. This may include something like explicit user-provided parameters on the structure one should expect, or inference from the metadata if that's sufficiently robust. As always, the simplest is to have a couple of small test folders with different cases (e.g. CZYX, TCZYX, TCYX, and YX?)

Re: point 2 This may be a bit complex, but the nice advantage is that we would be moving even closer to OME-NGFF specs. Note that sometimes we already have to specify named axes in the OME-NGFF metadata, e.g. in https://github.com/fractal-analytics-platform/fractal-tasks-core/blob/f85f88032701f06df3ee7ac3ddcf6941540a005f/fractal_tasks_core/napari_workflows_wrapper.py#L204-L215

Re: point 3 It should not be too challenging for functions with numpy arrays as inputs/outputs (thanks to broadcasting rules). It could be a bit trickier with dask arrays, but my feeling is that we are currently moving towards a direction where dask is mostly used to lazily-load arrays and organize the processing of several small parts (note that this could change, e.g. if we push towards in-task ROI-parallelization, and we may need to depend more heavily on dask arrays.. to be assessed).

jluethi commented 1 year ago

Very good overview @tcompa

The perspective is that we will handle arrays with mixed dimensions, which can be up to 5D (TCZYX) but also lack some of the intermediate channels (like TCYX)

Actually, arrays can be n-dimensional. We always expect YX to be there. Anything else is optional. There will often be Z (though not always, we'll need to make the 2D only case work as well, see https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/124). There often will be multiple channels (those typically can just be looped over) and there may be time information (sometimes to be looped over, i.e. process timepoint by timepoint, e.g. for segmentation. Some other times we'll need to process whole time series at once, e.g. to do tracking). And users may come up with extra dimensions at some point. We don't need to support processing those as long as we don't have clear use cases for them, but in an optimal case, we should fail when we get such OME-Zarr files / it should be easy to adapt a task to them.

Have some custom handling of the dimensionality in the zarr-creation tasks.

That seems good to me. We can be somewhat conservative in adding dimensions. Let's make sure 2D only (https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/124) and time data (https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/169) can be parsed, but hold off on more complex logic.

create_zarr_structure and yokogawa_to_zarr would include more logic

=> Sounds good to me. Let's add complexity where needed for the two issues above. I'll work on small test sets. The 2D is ready, the time one I will need to look into.

Consistently use named axis in all other tasks.

The seems like a very good approach to make sure we're stable when users start introducing different dimensions, when we only have specific ones.

Make sure that the relevant functions/tasks are capable of handling arrays of different shapes

Lets: a) Find a good way to define what input a task can handle, e.g. in its docstring b) Let's make sure the tasks then run on the different shapes they are supposed to work + explicitly load them

It could be a bit trickier with dask arrays

Good point. But our current approach should scale quite a while, I hope. Let's re-asses this if it becomes necessary

tcompa commented 1 year ago

A lot of discussion is ongoing in:

398
420
403

tcompa commented 11 months ago

Adding to this issue, work in https://github.com/fractal-analytics-platform/fractal-tasks-core/pull/557/files introduces the functions get_single_image_ROI and get_image_grid_ROIs which (in the current versions) do require a set of ZYX pixel sizes. These are obtained through the NgffImageMeta.pixel_sizes_zyx property, which is setting the Z pixel size to 1 if the corresponding channel is missing - and for this reason the import-ome-zarr task remains flexible.

In the future, also these new functions will need to be made more flexible (that is, they should not always require the Z pixel size).

fractal-analytics-platform / fractal-tasks-core

Support arrays of mixed-dimensionality #150

398

420

403