When should tasks write metadata files? Should they write them to disk at all?

jluethi commented 1 year ago

I think we should tackle another round of refactoring the metadata handling & how things are passed between tasks. Specifically, do tasks always need to write metadata? And can tasks start without metadata? We anyway already have a mix of some things saved to metadata, others read from the .zattrs files when needed I think.

Relevant issues: It would be of interest if parallel tasks don't write metadata (and we're anyway not using it), see: https://github.com/fractal-analytics-platform/fractal-server/issues/474#issuecomment-1506621022 Also, it is an open question on how to best start from an existing OME-Zarr file: https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/351

Also, an important principle: Metadata should be something we can get again from the OME-Zarr file: https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/212

Let's differentiate between input to plate-level & parallel tasks:

As input to a plate-level task => could just be read from the OME-Zarr, thus those tasks wouldn't need metadata input, right? If we can recreate all the metadata from the OME-Zarr file, we don't need to know it before the tasks starts.
To set up the tasks that parallelize over wells & to know what components exist => needs to be known before. But we wouldn't need to know everything about the components, just where the actual OME-Zarr images are (e.g. subfolder B/03/0) and how many images there are

This to me seems like: Plate-level task should produce some metadata (potentially minimal, e.g. components). They are used in all downstream tasks. But plate-level tasks don't need to take metadata as an input. In that setting, I'm not sure metadata is something that's passed from task to task, but rather something the first task creates and downstream tasks use. Parallel tasks shouldn't write any new components (right?)

Open questions:

[ ] What is the minimal set of metadata that really needs to be cached and passed to tasks?
- [ ] Component list
- [ ] Info about source of the microscopy image data needed in yokogawa-to-zarr => info passed from plate-level task to parallel task => do we have other examples of this?
[ ] How easy is it to get the other relevant metadata from within a task?
[ ] What is the best structure to store said metadata when parsed in a task? [could we have a lib function that generates the metadata from the OME-Zarr? Do we really need our own structure or could we mostly just read the existing jsons in the .zattrs and use them directly?]
[ ] Are there cases where parallel task would need to create new metadata (not supported now, so no?)

Let's keep in mind that we may soon move the MIP projections into the same Zarr file (=> see https://github.com/ome/ngff/issues/187), thus potentially having different components? That could actually make the logic above a bit more complex: What if we don't need the plate-level copy-ome-zarr anymore, because each projection can just be run within the image and produces a new projection. I think we could have the projections as subgroups within the existing components, but we should think through how this interacts with metadata handling and loading specific data.

jluethi commented 1 year ago

Let's also keep in mind multiplexing scenarios when we refactor things here. i.e. there are multiple images (components) per well, we do need to process them separately sometimes, but will also need to process them together in a single task execution in other scenarios.

Given that there are a few things to consider more deeply here, let's collect them here and come up with the requirements for the metadata, not start a refactor of it right now :)

jluethi commented 1 year ago

Redundant given https://github.com/fractal-analytics-platform/fractal-server/issues/802 and discussions on more flexible component handling in https://github.com/fractal-analytics-platform/fractal-server/issues/792

fractal-analytics-platform / fractal-server

When should tasks write metadata files? Should they write them to disk at all? #621