Propagate history from db `Dataset` to ome-zarr

jluethi commented 1 year ago

See overview here: https://github.com/fractal-analytics-platform/fractal-server/issues/506

Where is history written to?

We have a very short version of the history as part of metadata that is passed between tasks and which lists the name of the task and the components it was run on. This is later saved to the database (as part of the dataset?). This is useful when looking at the status of a job to see what processing has finished.

The downside of this is that the actual OME-Zarr dataset doesn't have its history. So if e.g. the file is shared with someone, they can't see how it was processed. If the database ever is lost or we loose the connection between the on-disk OME-Zarr file and the database, that history would also disappear.

Thus, a better place to store this history information would be in the OME-Zarr file itself. I think we should have such metadata for each OME-Zarr image (e.g. each image in a well). That would be generalizable even if other OME-Zarr collections than HCS plates are processed and we have a clear way to structure the granularity. We could store it in the image .zattrs file, similar to how multiscales and omero metadata is saved (see https://ngff.openmicroscopy.org/latest/#multiscale-md). The only limitation we need to tackle there eventually: What if some processing combined multiple images? e.g. if multiplexing processing combines images of different cycles. At first level, I still think we can probably just write the corresponding history in the output OME-Zarr image .zattrs, but to be investigates further.

The database could potentially still have some representation of that history, for easy querying or because it's written there first and then put into the OME-Zarr file. Or maybe the history starts (as now) in the metadata dictionary, and is then written into the OME-Zarr file (and some of it also into the database)?

tcompa commented 1 year ago

Thus, a better place to store this history information would be in the OME-Zarr file itself. I think we should have such metadata for each OME-Zarr image (e.g. each image in a well). That would be generalizable even if other OME-Zarr collections than HCS plates are processed and we have a clear way to structure the granularity. We could store it in the image .zattrs file, similar to how multiscales and omero metadata is saved (see https://ngff.openmicroscopy.org/latest/#multiscale-md). The only limitation we need to tackle there eventually: What if some processing combined multiple images? e.g. if multiplexing processing combines images of different cycles. At first level, I still think we can probably just write the corresponding history in the output OME-Zarr image .zattrs, but to be investigates further.

For the record, something related is discussed in https://github.com/ome/ngff/issues/174 (see http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#description-of-file-contents, where "Well-behaved generic netCDF filters will automatically append their name and the parameters with which they were invoked to the global history attribute of an input netCDF file.").

tcompa commented 11 months ago

I renamed the issue to reflect different aspects:

How would we propagate (a subset of) the new Dataset history (introduced with #803) to tasks and then to ome-zarrs?
Which subset of the history would need to go into ome-zarrs?

fractal-analytics-platform / fractal-server

Propagate history from db `Dataset` to ome-zarr #508