fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Task output API #679

Closed tcompa closed 2 months ago

tcompa commented 3 months ago

Originally posted by @jluethi in https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/671#issuecomment-2037752213

One thing I'd like to review is whether this is the intended & simplest image_list_update we can provide:

    image_list_update_dict = dict(
        image_list_updates=[
            dict(
                zarr_url=zarr_url,
                origin=init_args.origin_url,
                types=dict(
                    is_3D=False,
                ),
            )
        ]
    )

Having to provide a dict of a list of a dict seems kind of cumbersome. I understand why it needs to be a list of dicts, as some tasks will provide updates to multiple images. Having this list in a dict with the only key image_list_updates is there so we know whether a task provided an image_list_updates or a parallelization_list I assume?

tcompa commented 3 months ago

Having this list in a dict with the only key image_list_updates is there so we know whether a task provided an image_list_updates or a parallelization_list I assume?

Rather than the difference between image-list-updates and parallelization-list, returning a top-level dictionary is necessary to distinguish between image_list_updates, image_list_removals and filters. Agreed: when only one of those is set, the rich structure appears cumbersome.

As a by-product, returning a top-level dictionary also fits well with validation of the task metadata output through Pydantic models, but this is not a hard constraint - we can customize this validation logic at will to avoid Pydantic if we have a reason. An example to make this clear i that e.g. for init tasks we could in principle move from

return dict(parallelization_list=[...])

to

return [...]

and this would not be too hard to validate server-side, even if it's not vanilla Pydantic.

What we don't have an alternative for is the case of a parallel task (or a non-parallel standalone one), where we would need to replace

return dict(
    image_list_updates=[...],
    image_list_removals=[...],
    filters={...},
)

with something else.

jluethi commented 3 months ago

Ah right, thanks for the details on this!

Conclusion: the image_list_updates stay as they are for good reason. They are also the ones typically provided by task developers.

Whether we should simplify the parallelization_lists: May be nice. Can we still robustly detect the output and validate it server-side? e.g. avoid confusion between an init_task and a non-parallel task output. In general, the parallelization list felt less cumbersome to build in tasks so far, because we're actually putting many items in there. So the dictionary didn't feel like too much red tape there

jluethi commented 3 months ago

One thought that came up while working on the Import OME-Zarr task: What's the name of return object that can contain image_list_updates, image_list_removals and filters? I have been thinking about it as image_list_updates, but that's very confusing ;)

What is the overarching term for this dict that a typical task returns? image_list_changes? Any good ideas for naming here?

tcompa commented 2 months ago

One thought that came up while working on the Import OME-Zarr task: What's the name of return object that can contain image_list_updates, image_list_removals and filters? I have been thinking about it as image_list_updates, but that's very confusing ;)

What is the overarching term for this dict that a typical task returns? image_list_changes? Any good ideas for naming here?

Within fractal-server, we typically call it task_output, and the corresponding Pydantic models are named TaskOutput and InitTaskOutput. On the server side, there is no ambiguity between the actual task data output (on-disk) and the task metadata output (the one that goes back to Fractal).

On the tasks side, "metadata" is a bit ambiguous as in "fractal metadata" vs "NGFF metadata". Therefore we could maybe use names that includes "fractal". Verbose options include fractal_metadata, fractal_metadata_output, fractal_output, ..

tcompa commented 2 months ago

Whether we should simplify the parallelization_lists: May be nice. Can we still robustly detect the output and validate it server-side? e.g. avoid confusion between an init_task and a non-parallel task output. In general, the parallelization list felt less cumbersome to build in tasks so far, because we're actually putting many items in there. So the dictionary didn't feel like too much red tape there

For the record: this switch (from return {"parallelization_list": [...]} to return [...]) is not implemented server-side. Within #671 I am reverting the output metadata of init tasks to the currently-supported structure. We can later decide whether to change the server&tasks behavior to avoid the need of writing "parallelization_list".

jluethi commented 2 months ago

Let's go with task_output as the convention for how we call this