fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Extend/improve copy-zarr task #279

Open tcompa opened 1 year ago

tcompa commented 1 year ago

EDIT: I'm revamping this somewhat old discussion, based on last week meetings. The new comments start from https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/279#issuecomment-1637507613.


As per discussion with @gusqgm this morning. It's a task that copies a subset of a zarr, something like:

def copy_zarr_subset(input_zarr, output_zarr, a_list_of_filters):
    pass
gusqgm commented 1 year ago

Thank you @tcompa for adding this!

The main contextual example for this would be in my opinion the case scenario where a user is creating a new workflow from scratch and needs to test several parameters from one or more tasks. Instead of doing this over the entire .zarr data, the user can, at the desired moment of the workflow, generate a .zarr file with a subset of the data in order to use it for all required tests.

This .zarr file structure can be short-lived, i.e. created and used only for testing purposes, and later on discarded once the workflow is set and complete to run on full datasets. Could also be used to share data among collaborators, along with workflows, for example. However, we would need to enforce some information to be added to this zarr file so that it is not confused with its parent dataset, maybe adding a '_subset' as a suffix of the name?

Also, the largest drawback of this is that it requires the user to be vigilant and not have unnecessary multiple partial copies of the same data. The copy_zarr_subset task could function either at the beginning, i.e. creating the partial copy of the data before running a task for testing, or used at the end for the output, once the task being test has run over partial part of the main dataset. I assume that the second option is safer to avoid multiple identical copies of the data, however could be more cumbersome to implement as to create a partial input of the task which is not a separate .zarr on itself. What do you think?

I will think of more points as well.

jluethi commented 1 year ago

To quickly summarize my comments on this from the call: I think it's a good idea to have this as part of the "allow users to experiment with parameters". And it's an area of improved flexibility we can work on before figuring out the whole history of a dataset question.

Regarding cleanup, number of copies etc: I suggest we make a tmp folder in the output folder with such intermediary OME-Zarr files. Let's not get fancy about sharing or cleanup for the start, but just have them in their own space. The major goal: Allow users to test some parameters, check them on a small subset of the output and then adapt their workflows accordingly.


Another question: Should this be a task?

Technically maybe. But it's something very different in a typical user story and a typical flow. A user may have a workflow of existing tasks but e.g. want to try a few different parameters for the cellpose task on a single FOV. Now, this user could define an additional workflow that goes "copy OME-Zarr subset, then run a cellpose task". Maybe that's how we build it under the hood. But the user should eventually be able to say: I have this workflow, let me only run it on FOV 7 as a parameter test => gets that output to check (in a separate file is a good idea).

jluethi commented 1 year ago

Just to note this down before I forget: When we get to the topic of running on subsets, it's certainly nice to be able to run on a subset of an existing Zarr file. A big use case is only processing a subset of the available data though, i.e. I have a folder with 100k images, I want to test on 2-3 FOVs whether my processing pipeline makes sense, before I convert images to OME-Zarr for the first time. This couldn't be achieved with a copy of a subset of the Zarr file, because the Zarr file doesn't exist yet and parsing data into OME-Zarr typically is the slowest part.

(we can decide to only cover this later, having this flexibility for all later steps is great. Let's just be aware that we probably also want to support this user flow above)

tcompa commented 1 year ago

Just to note this down before I forget: When we get to the topic of running on subsets, it's certainly nice to be able to run on a subset of an existing Zarr file. A big use case is only processing a subset of the available data though, i.e. I have a folder with 100k images, I want to test on 2-3 FOVs whether my processing pipeline makes sense, before I convert images to OME-Zarr for the first time. This couldn't be achieved with a copy of a subset of the Zarr file, because the Zarr file doesn't exist yet and parsing data into OME-Zarr typically is the slowest part.

(we can decide to only cover this later, having this flexibility for all later steps is great. Let's just be aware that we probably also want to support this user flow above)

This has been already covered by the image-glob-pattern argument of zarr-creation tasks, so that the current issue only concerns the situation where we already have an OME-Zarr.

tcompa commented 1 year ago

Based on last week meetings, it seems that an improved version of the copy-ome-zarr task could be a nice starting point for the "let me work on an experimental branch of my workflow" use case - even if this use case is not yet fully defined on the server/web side.

Some of the proposed new features:

  1. The task should offer the option to copy data as well, on top of the OME-Zarr structure and metadata.
  2. For the moment we should also maintain the option of not copying any data, since it's how MIP works (hopefully this will change in the future).
  3. The task should offer the option to only select a subset of the OME-Zarr components - see below.
  4. Copying data should all happen in the main task, even if in principle it could be parallelized over wells. There are multiple reasons for this:
    • Compound prepare&fill tasks are not intuitive, when building a workflow -> let's reduce their use as much as possible
    • When selecting a subset of the OME-Zarr data, it would be complex to let the server build the appropriate component list.
    • Copying a small array should still be a reasonably fast operation, and we can verify that it gets a bit faster by increasing the CPU requirements. Copying a large array is not something which we should ever encourage, so that we don't need to optimize this use case.
  5. The writing of updated metadata will then need to be aligned with https://github.com/fractal-analytics-platform/fractal-server/issues/792.

Concerning the subset-filter, here are some possibilities (sorted by increasing complexity): V0: select a single well, or a list of wells V1: select the same ROI from all wells (TBD what to do if it does not exist) V2: same as V1, but handling edge cases V3: select a specific ROI from each well V4: select N ROIs from M well => into individual OME-Zarr V5: select N ROIs from M well => into the same OME-Zarr