fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Field of view parallelization via OME-NGFF ROI tables #24

Closed jluethi closed 2 years ago

jluethi commented 2 years ago

The problem

We need to have a way to process information per original field of view. Our initial attempts to do so by saving each field of view to a separate folder within the well (following the HCS spec) has failed due to performance issue of browsing such data at low resolution. See https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/12 for the details.

Some of our tasks still rely on processing based on the original FOVs. For example, illumination is run by FOV and currently needs the image sizes as an input: https://github.com/fractal-analytics-platform/fractal/blob/529345b2d2f855a2b9cae7b237965fe35ee314bb/fractal/tasks/illumination_correction.py#L30 Labeling currently hard-codes FOV-sizes: https://github.com/fractal-analytics-platform/fractal/blob/529345b2d2f855a2b9cae7b237965fe35ee314bb/fractal/tasks/image_labeling.py#L112

We should generalize this and enable labeling to be done at different pyramid levels (e.g. labeling may be faster and perform good enough at pyramid level 1). For this, we need information about where the individual FOVs are within the Zarr file. Plus, if we want to distribute the processing e.g. of labeling to multiple jobs, we need to be able to know what those jobs will be ahead of time.

Proposal

I suggest that we use the proposed OME-NGFF Tables for this. https://forum.image.sc/t/proposal-for-ome-ngff-table-specification/68908

The current OME-NGFF Tables proposal is for a table that is used to save features (see https://github.com/fractal-analytics-platform/fractal/issues/27 for discussions on how we can use it for our feature saving). Saving regions of interest (ROIs) is mentioned, but more as an outlook of a potential type of table to be supported. The reason for this is that ROIs can get quite complicated if one wants to save arbitrary 3D polygons. See some early work on this here: https://github.com/openssbd/bdz

For our use case though, I think we can get away with a very simple RO definition. It could be a table like the following:

FOV z_corner y_corner x_corner z_len y_len x_len
0 0 0 0 19 351 416
1 0 0 416 19 351 416
2 0 0 832 19 351 416
3 0 0 1248 19 351 416
9 0 351 0 19 351 416
10 0 351 416 19 351 416

The corner coordinates & lengths are in physical units, thus one can map them to pixels by combining the scale information & the coordinate. We could populate this table based on the metadata parsed from the Yokogawa or generated separately (see https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/25)

Using such a table would allow us 2 things:

  1. We can check the table to know which FOVs we parallelize over
  2. We can load the spatial coordinate for each FOV and thus load the data at any level we are interested in

If this works well, such an approach could eventually be generalized and applied to e.g. ROIs based on organoid segmentation, ROIs that don't follow the original FOVs anymore.

Limitations

There is not yet a defined way to save ROI information to the OME-NGFF Tables (the tables themselves are still very new, but work well using this prototype: https://github.com/kevinyamauchi/ome-ngff-tables-prototype). Given that the generalized way of saving ROIs could get quite a bit more complex (arbitrary shapes etc.), I suggest we start with a simple ROI table that just handles rectangular cuboids (3D rectangles, i.e. all sides are 90 degrees, but side lengths aren't necessarily equal. If a standard of how to save ROIs emerges later and is both general & still easy to use, we could still switch to that.

tcompa commented 2 years ago

Here are two points to consider, based on discussions with @mfranzon and @jluethi.

Parallel writing of ROIs

Two ROIs can have different kinds of overlaps, that make parallel writing of these regions unsafe.

The first case is when two ROIs have an actual overlap in physical space. This is probably not true in most cases (think about ROIs=FOVs, or ROIs=organoids), and in some cases we could mitigate this problem by switching to complex 3D shapes. Anyway, we can probably also check for this issue explicitly, if needed. This is not considered a serious problem at the moment.

The second case is overlap in the underlying array structure, i.e. two ROIs may belong to the same chunk (a single ROI can also touch several chunks, but this doesn't change the current issue). The only case in which this problem is not there is when ROIs=FOVs and we work at the highest-resolution level (where chunks=FOVs). In all other cases, this problem is present. For this reason, it is not safe to parallelize over ROIs, since it would possibly lead to simultaneous writing of the same chunk during the processing of two different ROIs.

Current (planned) solution: parallelization happens over wells, and within wells we scan through ROIs in a sequential way. Note that this requires a knowledge of "which ROIs belong to a certain well" (which shouldn't be very hard to obtain).

Storing data at lower-resolution levels

One of the reasons for defining ROIs is that some processing (e.g. labeling) can happen for FOVs at pyramid levels larger than 0 (the highest-res one), say level 1. The output (e.g. a mask) has a shape (in pixels) which is different from level 0 in the image array, but it will still be saved as level 0 in the new (labels) folder.

There are at least two options for how to proceed, and we are now aiming at the second one:

  1. We upscale the mask upon saving, so that the level 0 in the labels folder has exactly the same shape as the level 0 of the data.
  2. We store the mask at the original resolution, ending up with a shape mismatch between level 0 in the labels folder and in the data folder. For visualization, the mismatch is fixed by adding coordinateTransformations scaling in the metadata. For further processing (e.g. regionprops measurements), the mismatch is fixed by upscaling the mask on-the-fly before processing. Note that this means that all downstream processing needs to be aware of the fact that upscaling is needed, and of its parameters.
tcompa commented 2 years ago

Question (regarding parallel writing): maybe someone else already thought about parallel writing of arbitrary irregular regions of an array? To be verified.

jluethi commented 2 years ago

The second case is overlap in the underlying array structure, i.e. two ROIs may belong to the same chunk (a single ROI can also touch several chunks, but this doesn't change the current issue). The only case in which this problem is not there is when ROIs=FOVs and we work at the highest-resolution level (where chunks=FOVs). In all other cases, this problem is present. For this reason, it is not safe to parallelize over ROIs, since it would possibly lead to simultaneous writing of the same chunk during the processing of two different ROIs.

It could be worth looking into the advanced indexing for Zarr. I haven't fully understood it yet, but this part of the documentation does sound promising:

As of version 2.2, Zarr arrays support several methods for advanced or “fancy” indexing, which enable a subset of data items to be extracted or updated in an array without loading the entire array into memory.

Extracting & updating subsets of data is what we are after here. Not sure whether this would still fail when multiple chunks are attempted to be written to in parallel. But let's test this :) And let's keep looking whether someone has an example or implementation to get around this issue. If all else fails, the whole well workaround you describe seems feasible @tcompa. I'd imagine we'll have a ROI table per well, so should be trivial to know what needs to be processed together.

jluethi commented 2 years ago

Regarding Storing data at lower-resolution levels: Yes, we can go with both ways. Let's briefly discuss this at next week's call as well.

Personally, I think going with option 2 would be preferable. It is the general case and allows us to store multi-resolution data if we ever need to support this. If we can use the same coordinateTransformations metadata to do this, we'll have a very general solution at hand. We could then abstract this by having a loading function that can be called by different tasks. Would need to brainstorm the best format for this loading function though.

tcompa commented 2 years ago

It could be worth looking into the advanced indexing for Zarr. I haven't fully understood it yet, but this part of the documentation does sound promising:

Interesting, thanks for the link. The relevant section is https://zarr.readthedocs.io/en/stable/tutorial.html#parallel-computing-and-synchronization. Some quotes from a first quick read:

Zarr arrays have been designed for use as the source or sink for data in parallel computations. [..] By data sink we mean that multiple concurrent write operations may occur, with each writer updating a different region of the array. Zarr arrays have not been designed for situations where multiple readers and writers are concurrently operating on the same array.

When using a Zarr array as a data sink, some synchronization (locking) may be required to avoid data loss, depending on how data are being updated. If each worker in a parallel computation is writing to a separate region of the array, and if region boundaries are perfectly aligned with chunk boundaries, then no synchronization is required. However, if region and chunk boundaries are not perfectly aligned, then synchronization is required to avoid two workers attempting to modify the same chunk at the same time, which could result in data loss.

Unclear to me whether the synchronization is taken care of by zarr (with something like synchronizer=zarr.ThreadSynchronizer()) only if the parallel writing comes from a single operation (i.e. a single line in a single python script), or whether this also works when the parallel writing comes from arbitrary sources (i.e. we have two independent scripts that load the zarr and write on it). We'll need to look at this further.

tcompa commented 2 years ago

Closing this, moving to