fractal-analytics-platform / fractal-server

Fractal backend
https://fractal-analytics-platform.github.io/fractal-server/
BSD 3-Clause "New" or "Revised" License
11 stars 3 forks source link

Review import/export of workflows in v2.7.0 #1835

Open tcompa opened 2 weeks ago

tcompa commented 2 weeks ago

https://github.com/fractal-analytics-platform/fractal-server/issues/1829 should offer a way to move from the task source to the taskgroup attributes (plus a single task-specific attribute, e.g. its name).

I guess we should maintain

tcompa commented 1 week ago

Now mostly defined in #1831 and #1829.

tcompa commented 6 days ago

Now mostly defined in #1831 and #1829.

In view of major updates defined in these issues (and especially in https://github.com/fractal-analytics-platform/fractal-server/issues/1831#issuecomment-2393700802), the preliminary plan is the following:

The precise meaning of the new identity matching is based on https://github.com/fractal-analytics-platform/fractal-server/issues/1831#issuecomment-2393700802 and on https://github.com/fractal-analytics-platform/fractal-server/issues/1859. I'll later propose a precise priority-based definition in this issue - since we initially ignored the fact that we also may have to disambiguate similar tasks (ref https://github.com/fractal-analytics-platform/fractal-web/issues/582).

tcompa commented 2 days ago

Database tasks

  1. For any instance created afresh with v2.7, tasks have no source.
  2. For any instance at a version lower than v2.7, all tasks have (unique) sources
  3. For an instance upgraded from <2.7 to >=2.7, there will be a mix of tasks with source (the ones created before the migration) and without source (the newly-created ones).

Exported workflows

  1. All workflows exported to JSON with a version <2.7 have tasks identified by their full source.
  2. All workflows exported to JSON with a version >=2.7 have tasks identified by their "new" identity, which is a broader definition.
  3. The previous point holds independently on whether the tasks were created before or after the version upgrade.
  4. The new identity is made of a task-group identity (required pkg_name + optional version field) and a individual-task identity (name).

Imported workflows

  1. The JSON of a workflow to be imported into a v2.7 instance may include a mix of tasks defined by source and tasks defined by their new identity.
  2. Each task can be identified in a single way: it either defines a source or a pkg_name+version+name.
  3. Version is an optional field - see below.

Examples of tasks to be imported

Invalid examples

The following task specification is invalid, since it has both old and new identity information:

{
 "source": "fractal-tasks-core-1.2.3-cellpose",
 "pkg_name": "fractal-tasks-core",
 "version": "1.2.3",
 "name": "cellpose_segmentation"
}

The following task specification is invalid, since it lacks the task name:

{
 "pkg_name": "fractal-tasks-core",
 "version": "1.2.3"
}

Valid examples

Source-based identity

The following spec is OK

{
 "source": "fractal-tasks-core-1.2.3-cellpose"
}

and the backend will look for a task with this exact source (which, for the record, can only have been created before the v2.7 update).

New identity, with version

The following task is OK:

{
 "pkg_name": "fractal-tasks-core",
 "version": "1.2.3",
 "name": "cellpose_segmentation"
}

and the backend will look for a task which matches these three properties exactly. There may be more than one such task - see "disambiguation" section below.

New identity, without version

The following task is OK:

{
 "pkg_name": "fractal-tasks-core",
 "name": "cellpose_segmentation"
}

and the backend will look for a task which matches these three properties exactly (more on this below). There may be more than one such task - see "disambiguation" section below.

Disambiguation for non-unique tasks

When given a source, there may be either zero or one task matching. When given pkg_name+version+name, or pkg_name+name, there may be several matching tasks. Here is a description of how to proceed:

More than one result, with a given version set

A query for tasks with pkg_name+version+name may return up to 1+N results, where N is the number of user groups such that: (1) I am part of that group, and (2) the group owns one such task. In order to pick a specific task, the priority logic is as follows:

  1. A task that belong to the current user (task_group.user_id = my_user_id) has highest priority.
  2. A task that belongs to the All user group follows.
  3. Other results are listed based on the timestamp of the user/usergroup association (older -> higher priority).

ALTERNATIVE OPTION: we could also switch 1 and 2, and give "All" the highest priority.

More than one result, with no version set

When exporting a task that was collected automatically by fractal-server, it will always look like

{
 "pkg_name": "fractal-tasks-core",
 "version": "1.2.3",
 "name": "cellpose_segmentation"
}

but what should the backend do when requested to import a task like

{
 "pkg_name": "fractal-tasks-core",
 "name": "cellpose_segmentation"
}

?

Some options:

  1. We keep looking for an exact match and consider version=None, which means that we won't find any matching task in this case.
  2. We consider version=None as a dummy filter, and rather return the latest available version of the task. This means that by editing the JSON of an exported workflow we can make its import less deterministic but give it more chances of success on different instances (or on instances where some old tasks were later deleted or deactivated).

Side comment: non-accessible tasks

The import-workflow operation works on a per-user basis. It can only succeed if all tasks that are identified are also accessible to the user. If not, it should fail (rather than providing a partially-non-accessible workflow).

Side comment: non-active tasks

The import-workflow operation does not include a check of whether imported tasks are active. If I am importing an old workflow where I pinned all versions, and in the meanwhile those versions were deactivated on the fractal-server instance, then the imported workflow will appear with multiple warnings and I won't be able to run it as a job.

ALTERNATIVE OPTION: the import fails when no active matching tasks are found.

Side comment: task name

Up to now, we discussed task-group identity in multiple v2.7 threads. In order to specify a single task, we also need something more granular, and the obvious candidate here is the task name. This introduces some additional requirements for the backend:

Deprecation of backwards-compatibility

Moving forward, there will be a point in the future where we should discontinue the "import-task-based-on-source" feature. This will invalidate all exported-workflow JSON files created with versions <2.7, thus clearly being a breaking change.

jluethi commented 2 days ago

ALTERNATIVE OPTION: we could also switch 1 and 2, and give "All" the highest priority.

I think it's useful that "my personal task" has highest priority. It allows a user to recollect a task for themselves that would have issues in the ALL collection, thus working around some issues without needing admin support.

We consider version=None as a dummy filter, and rather return the latest available version of the task. This means that by editing the JSON of an exported workflow we can make its import less deterministic but give it more chances of success on different instances (or on instances where some old tasks were later deleted or deactivated).

I'd be in favor of that. If one imports a workflow with tasks that don't have their version specified, one gets the newest available versions of those tasks. That may or may not work (depending on whether parameters have changed). But given that we never export workflows that way, I think it's a good trade-off

The import-workflow operation does not include a check of whether imported tasks are active. If I am importing an old workflow where I pinned all versions, and in the meanwhile those versions were deactivated on the fractal-server instance, then the imported workflow will appear with multiple warnings and I won't be able to run it as a job.

As long as we can make those warnings visible to the user (e.g. also show feedback during import that some tasks were deactivated, they may need to update to a newer version), I think that's fine

Up to now, we discussed task-group identity in multiple v2.7 threads. In order to specify a single task, we also need something more granular, and the obvious candidate here is the task name. This introduces some additional requirements for the backend:

A task name should be immutable Upon task-group creation, we must verify that no two tasks with the same name exist in the same group.

Sounds good to me

Moving forward, there will be a point in the future where we should discontinue the "import-task-based-on-source" feature. This will invalidate all exported-workflow JSON files created with versions <2.7, thus clearly being a breaking change.

I'd be quite hesitant of dropping this, even once we don't have any of those tasks active anymore on the server. This breaks importing old workflows, so I'd much prefer to keep this active

tcompa commented 1 day ago

Upon task-group creation, we must verify that no two tasks with the same name exist in the same group.

Unique constraint on TaskV2 taskgroup_id and name columns.