fractal-analytics-platform / fractal-server

Fractal backend
https://fractal-analytics-platform.github.io/fractal-server/
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Enforce that `zarr_url` must be relative to `zarr_dir` #1319

Closed tcompa closed 2 months ago

tcompa commented 3 months ago
  1. Within the runner:
    • Every time an image is created, perform this check. Note: this also includes checking places where we .append to the image list
    • Forbid edits of an image path/zarr_url attribute
  2. Within the API:
    • Every time an image is created, perform this check.
    • Forbid edits of an image path/zarr_url attribute
    • Forbid edits a dataset zarr_dir
jluethi commented 3 months ago

Just for my understanding: zarr_url is still the full path to the zarr file. But it should always contain the zarr_dir base?

e.g. a valid zarr_url is: /path/to/zarr_dir/myzarr.zarr/B/03/0 in the zarr_dir: /path/to/zarr_dir/

Or does this mean something different?

Big picture, let's not put this constraint too broadly. I see it as useful, but there may be a future (especially with OME-Zarr collections) where some zarr files (e.g. labels) live in different folders than the other zarr files.

tcompa commented 3 months ago

Just for my understanding: zarr_url is still the full path to the zarr file. But it should always contain the zarr_dir base?

e.g. a valid zarr_url is: /path/to/zarr_dir/myzarr.zarr/B/03/0 in the zarr_dir: /path/to/zarr_dir/

Yes, this is correct.

More details:

  1. We opted for absolute image paths, rather than base+relative, to streamline the task API (a single absolute-path argument rather than two partially-informative arguments). This has a cost in terms of workflow portability and re-use, since the information on where to write zarrs is now part of the workflow (rather than dataset); in this case, re-submitting the same workflow with a different dataset is not so well defined any more.
  2. Early version of converters had a zarr_dir (absolute-path) input argument, to determine where new images would be created.
  3. Such zarr_dir argument would not fit very well with portability, and it would also introduce a small overwrite risk (if I export a worfklow and then re-import it, the zarr_dir value is the same as in the original workflow).
  4. To improve the use case where the same workflow (with or without the export/import intermediate steps) is re-used on a different dataset, we moved the task arguments zarr_dir to the dataset attribtues DatasetV2.zarr_dir.
  5. The duplication of this information (paths are absolute, but the basedir is also stored elsewhere) leads to the additional constraints described in the current issue.
  6. Only a few tasks would need to use zarr_dir (e.g. the init tasks of converters and the init task of MIP). Right now, we are passing this argument only to non-parallel tasks, while parallel tasks do not get it. If we want to be even more granular, we would need to add a task attribute such as task.requires_zarr_dir = True/False.
  7. We are only introducing the constraint server-side, and not within the tasks. Once the task receives a given zarr_url(s) (which must belong to zarr_dir), then it can still load/write data from any arbitrary path.

Big picture, let's not put this constraint too broadly. I see it as useful, but there may be a future (especially with OME-Zarr collections) where some zarr files (e.g. labels) live in different folders than the other zarr files.

I'd say that the scope threshold here is set as in point 7 above - does it sound right?

jluethi commented 3 months ago

I'd say that the scope threshold here is set as in point 7 above - does it sound right?

Sounds good. Yes, let's go with it this way then. That scope is narrow enough that we can change it again if in 6-12 months, collections with more complex behaviors come up :)

My main point here: For what we're currently planning, zarr_dir will always be the base of the zarr_url. And it makes sense to only show the second part in some interfaces (e.g. in the image list): myzarr.zarr/B/03/0, given that we have a common base_dir.

But we shouldn't have the assumption that entries of the image list share a zarr_dir too deep in our architecture, because that could change with future OME-Zarr versions.