fractal-analytics-platform / fractal-server

Fractal backend
https://fractal-analytics-platform.github.io/fractal-server/
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

When should tasks write metadata files? Should they write them to disk at all? #621

Closed jluethi closed 1 year ago

jluethi commented 1 year ago

I think we should tackle another round of refactoring the metadata handling & how things are passed between tasks. Specifically, do tasks always need to write metadata? And can tasks start without metadata? We anyway already have a mix of some things saved to metadata, others read from the .zattrs files when needed I think.


Relevant issues: It would be of interest if parallel tasks don't write metadata (and we're anyway not using it), see: https://github.com/fractal-analytics-platform/fractal-server/issues/474#issuecomment-1506621022 Also, it is an open question on how to best start from an existing OME-Zarr file: https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/351

Also, an important principle: Metadata should be something we can get again from the OME-Zarr file: https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/212


Let's differentiate between input to plate-level & parallel tasks:

  1. As input to a plate-level task => could just be read from the OME-Zarr, thus those tasks wouldn't need metadata input, right? If we can recreate all the metadata from the OME-Zarr file, we don't need to know it before the tasks starts.
  2. To set up the tasks that parallelize over wells & to know what components exist => needs to be known before. But we wouldn't need to know everything about the components, just where the actual OME-Zarr images are (e.g. subfolder B/03/0) and how many images there are

This to me seems like: Plate-level task should produce some metadata (potentially minimal, e.g. components). They are used in all downstream tasks. But plate-level tasks don't need to take metadata as an input. In that setting, I'm not sure metadata is something that's passed from task to task, but rather something the first task creates and downstream tasks use. Parallel tasks shouldn't write any new components (right?)


Open questions:


Let's keep in mind that we may soon move the MIP projections into the same Zarr file (=> see https://github.com/ome/ngff/issues/187), thus potentially having different components? That could actually make the logic above a bit more complex: What if we don't need the plate-level copy-ome-zarr anymore, because each projection can just be run within the image and produces a new projection. I think we could have the projections as subgroups within the existing components, but we should think through how this interacts with metadata handling and loading specific data.

jluethi commented 1 year ago

Let's also keep in mind multiplexing scenarios when we refactor things here. i.e. there are multiple images (components) per well, we do need to process them separately sometimes, but will also need to process them together in a single task execution in other scenarios.

Given that there are a few things to consider more deeply here, let's collect them here and come up with the requirements for the metadata, not start a refactor of it right now :)

jluethi commented 1 year ago

Redundant given https://github.com/fractal-analytics-platform/fractal-server/issues/802 and discussions on more flexible component handling in https://github.com/fractal-analytics-platform/fractal-server/issues/792