fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Rerunning a workflow with copy OME-Zarr task leads to _mip_mip.zarr files #510

Closed jluethi closed 1 year ago

jluethi commented 1 year ago

I have observed an unexpected behavior when rerunning workflows with the copy OME-Zarr tasks creates an additional level 20200812-CardiomyocyteDifferentiation14-Cycle1_mip_mip.zarr, then 20200812-CardiomyocyteDifferentiation14-Cycle1_mip_mip_mip.zarr when running a third time.

Expected behavior: I rerun the workflow, the content of the 20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr gets replaced with the new content.

Observed behavior: Every rerun creates a new _mip addition to the OME-Zarr.

Not sure if this has to do with the dataset somehow? i.e. the way we update the components in the copy task

Screenshot 2023-09-05 at 19 12 44

Example workflow (the registration tasks can also be skipped, just the 4 tasks are relevant): workflow-export-Multiplexing_registration-1693933995989.txt

tcompa commented 1 year ago

This behavior is consistent with how the copy task is defined. The timeline (in the observed behavior) should look something like this:

  1. I prepare a valid ome-zarr. In the component list (within the dataset metadata), I have the list of plates (say it's a single one, plate.zarr), and of wells/images.
  2. I run a (non-parallel) copy-ome-zarr task with suffix mip. This copies plate.zarr into plate_mip.zarr, and modifies in-place all entries of the component list with the replacement plate.zarr -> plate_mip.zarr.
  3. From now on, the plate.zarr string is not any more present in the dataset metadata, and the input for the next task is plate_mip.zarr. If I re-run a copy-ome-zarr task with suffix abc, this will copy plate_mip.zarr into plate_mip_abc.zarr, and so on.

In other words, we do not have a rich structure for component lists (like 2D vs 3D label), and at at the same time we would not currently able to "select" what the current components to work on are. Ref https://github.com/fractal-analytics-platform/fractal-server/issues/792, for instance, but this touches upon multiple issues related to the definition of tasks I/O, the constraints on workflow structure, the reserved arguments, ...

I think this is exactly what Joel's two helper tasks (the ones for 3D->2D->3D workflows) were doing, right? Having some "fake" task that is there only in order to update the dataset metadata in the desired way. Note that this may be possible also by modifying the dataset properties (*), but that would require a user's action (and then it would split workflow execution in two parts).

(*) I'm not sure of whether this is currently supported, but it'd be easy to add to fractal-server and to the clients if needed.

jluethi commented 1 year ago

Hmm, yes, that makes sense. I guess it shows the limitation of our current component handling system and how overwriting components to get from 3D to 2D was always a bit of a hack.

Unfortunately, this breaks the core flow for rerunning workflows once they contain a copy-ome-zarr anywhere. We clearly have to address the general handling of components eventually.

What surprised me here is that I'm only reusing the same output dataset and rerunning the full workflow. I would have expected that to work, as the initial task does create a fresh, correct component list. I don't fully understand why it gets the component list of the output dataset instead of the one created by the Create OME Zarr task.

Let's briefly discuss afterwards

tcompa commented 1 year ago

I'm trying this:

  1. Prepare I/O dataset
  2. Prepare a workflow with create_ome_zarr+yokogawa_to_zarr+copy_ome_zarr+MIP
  3. Run with the appropriate I/O datasets. After this run is complete, I find that:
    • The output folder includes plate.zarr and plate_mip.zarr
    • The dataset components/history look correct (without any mip_mip).
  4. Re-run the same workflow once again, with the same I/O datasets

At this point, I also observe plate_mip_mip.zarr was created (by the copy-ome-zarr task, and in fact it contains no actual image data). The copy-ome-zarr logs read

2023-09-06 11:40:35,393; INFO; START copy_ome_zarr task
2023-09-06 11:40:35,393; INFO; list_plates=['/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr', '/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr']
2023-09-06 11:40:35,393; INFO; zarrurl_old='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr'
2023-09-06 11:40:35,393; INFO; zarrurl_new='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr'
2023-09-06 11:40:35,393; INFO; meta_update={'copy_ome_zarr': {'suffix': 'mip', 'sources': {'20200812-CardiomyocyteDifferentiation14-Cycle1_mip': '/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr'}}}
2023-09-06 11:40:35,403; INFO; Start open_zarr_group_with_overwrite (overwrite=True).
2023-09-06 11:40:35,403; INFO; Zarr group /home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr already exists, with keys=[]
2023-09-06 11:40:35,407; INFO; well_paths=['B/03']
2023-09-06 11:40:35,407; INFO; image_paths=['0']
2023-09-06 11:40:35,408; INFO; I will now read FOV_ROI_table from zarrurl_old='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr', convert it to 2D, and write it back to the new zarr file.
2023-09-06 11:40:35,419; INFO; I will now read well_ROI_table from zarrurl_old='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr', convert it to 2D, and write it back to the new zarr file.
2023-09-06 11:40:35,430; INFO; zarrurl_old='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr'
2023-09-06 11:40:35,430; INFO; zarrurl_new='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip_mip.zarr'
2023-09-06 11:40:35,430; INFO; meta_update={'copy_ome_zarr': {'suffix': 'mip', 'sources': {'20200812-CardiomyocyteDifferentiation14-Cycle1_mip': '/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1.zarr', '20200812-CardiomyocyteDifferentiation14-Cycle1_mip_mip': '/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr'}}}
2023-09-06 11:40:35,430; INFO; Start open_zarr_group_with_overwrite (overwrite=True).
2023-09-06 11:40:35,430; INFO; Zarr group /home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip_mip.zarr does not exist yet.
2023-09-06 11:40:35,430; INFO; well_paths=['B/03']
2023-09-06 11:40:35,431; INFO; image_paths=['0']
2023-09-06 11:40:35,431; INFO; I will now read FOV_ROI_table from zarrurl_old='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr', convert it to 2D, and write it back to the new zarr file.
2023-09-06 11:40:35,442; INFO; I will now read well_ROI_table from zarrurl_old='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr', convert it to 2D, and write it back to the new zarr file.
2023-09-06 11:40:35,452; INFO; END copy_ome_zarr task

and the line which doesn't look right is

2023-09-06 11:40:35,430; INFO; zarrurl_old='/home/fractal-share/data/output2/20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr'

More on this soon.

tcompa commented 1 year ago

Here is the explanation (in copy-ome-zarr):

    # List all plates
    in_path = Path(input_paths[0])
    list_plates = [p.as_posix() for p in Path(in_path).glob("*.zarr")]   # <--- THIS IS THE PROBLEM
    logger.info(f"{list_plates=}")

    meta_update: dict[str, Any] = {"copy_ome_zarr": {}}
    meta_update["copy_ome_zarr"]["suffix"] = suffix
    meta_update["copy_ome_zarr"]["sources"] = {}

    # Loop over all plates
    for zarrurl_old in list_plates:
        zarrfile = zarrurl_old.split("/")[-1]
        old_plate_name = zarrfile.split(".zarr")[0]
        new_plate_name = f"{old_plate_name}_{suffix}"
        new_plate_dir = Path(output_path).resolve()
        zarrurl_new = f"{(new_plate_dir / new_plate_name).as_posix()}.zarr"
        meta_update["copy_ome_zarr"]["sources"][new_plate_name] = zarrurl_old
        ... # copy zarrurl_old to zarrurl_new

We are currently copying all zarr files in the input folder.

tcompa commented 1 year ago

Here is a first solution:

The metadata input parameter already includes a list of plates (the value corresponding to the plate key). At the very least, we should use that one rather than just listing the existing folders. We should still verify that each plate does exist in the input folder, but that's not enough as a criterion for including a plate in the copying procedure.

Any better idea?

tcompa commented 1 year ago

Of course we can do better, e.g. we could make this one a plate-level task, but this would require some not-yet-available features:

Sticking with the current non-parallel version, we can make it more flexible with some glob patterns - but this would mostly be a temporary fix while waiting for more flexible architecture/tasks. Therefore I'd rather stick with a default behavior (and no new input parameter), and improve it as part of the planned future refactoring.