fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Add Pydantic model for `Channel`s #386

Closed tcompa closed 1 year ago

tcompa commented 1 year ago

Branching from https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/377.

For the moment I could only obtain something interesting via pydantic, while I got nowhere while trying with dataclasses or plain classes. A basic example would be this script

from typing import Optional
import json
from pydantic import BaseModel
from pydantic.decorator import ValidatedFunction
from lib_args_schemas import _remove_args_kwargs_properties
from lib_args_schemas import _remove_pydantic_internals

class Channel(BaseModel):
    wavelength_id: str
    label: str
    start: Optional[int]
    end: Optional[int]

def function(channels: list[Channel]):
    pass

schema = ValidatedFunction(function, config=None).model.schema()
schema = _remove_args_kwargs_properties(schema)
schema = _remove_pydantic_internals(schema)
print(json.dumps(schema, indent=2, sort_keys=True))

that prints this schema

{
  "additionalProperties": false,
  "definitions": {
    "Channel": {
      "properties": {
        "end": {
          "title": "End",
          "type": "integer"
        },
        "label": {
          "title": "Label",
          "type": "string"
        },
        "start": {
          "title": "Start",
          "type": "integer"
        },
        "wavelength_id": {
          "title": "Wavelength Id",
          "type": "string"
        }
      },
      "required": [
        "wavelength_id",
        "label"
      ],
      "title": "Channel",
      "type": "object"
    }
  },
  "properties": {
    "channels": {
      "items": {
        "$ref": "#/definitions/Channel"
      },
      "title": "Channels",
      "type": "array"
    }
  },
  "required": [
    "channels"
  ],
  "title": "Function",
  "type": "object"
}

Notes:

  1. We should make sure that this schema can be supported in fractal-web.
  2. This approach has side effects:
    • Pydantic becomes a primary dependency for the tasks (adding to the discussion in #375);
    • We will address attributes via channel.label, rather than channel["label"] (not a big deal, but propagating this into the fractal_tasks_core.lib subpackage could be annoying, because suddenly any other package using our lib will have to depend on pydantic at runtime).
  3. We won't be able to fully exploit the flexibility of JSON Schemas (e.g. adding mutually exclusive class attributes, or other more complex definitions), or at least not easily. This does not actually depend on the Pydantic option discussed here, but it is related to our automatic creation of schemas. There may be a point in the future where we can move (or at least copy) some logical checks from the task Python function to the JSON Schemas of its arguments, but we'll have to decide where/how this operation is made (is it done by hand? is the relevant info stored somewhere?)
tcompa commented 1 year ago

For the record, and assuming we proceed as described above:

If we add a default value to one of the internal variables (e.g. `wavelength_id: str = "something"), then this is correctly written into the JSON Schema, meaning that it could be available in fractal-web.

Note, however, that this will not be used by fractal-server when populating WorkflowTask.args with the defaults in the JSON Schemas, since only the "first-level" defaults will be read and assigned to args (ref https://github.com/fractal-analytics-platform/fractal-server/issues/639). This looks a very intuitive behavior to me, but let's make sure we all agree.

jluethi commented 1 year ago

This approach has side effects: Pydantic becomes a primary dependency for the tasks (adding to the discussion in https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/375);

I guess that's a price that should be fine to pay, especially if we go the pydantic.v1 route once reasonable?

We will address attributes via channel.label, rather than channel["label"] (not a big deal, but propagating this into the fractal_tasks_core.lib subpackage could be annoying, because suddenly any other package using our lib will have to depend on pydantic at runtime).

I'd even call this a potential win, channel.label seems like the better way to reference it.

We won't be able to fully exploit the flexibility of JSON Schemas (e.g. adding mutually exclusive class attributes, or other more complex definitions), or at least not easily. This does not actually depend on the Pydantic option discussed here, but it is related to our automatic creation of schemas.

Pity, that would be neat.

There may be a point in the future where we can move (or at least copy) some logical checks from the task Python function to the JSON Schemas of its arguments, but we'll have to decide where/how this operation is made (is it done by hand? is the relevant info stored somewhere?)

That would be great to make the tasks a bit simpler, i.e. having less input validation at the start.

Note, however, that this will not be used by fractal-server when populating WorkflowTask.args with the defaults in the JSON Schemas, since only the "first-level" defaults will be read and assigned to args (ref https://github.com/fractal-analytics-platform/fractal-server/issues/639). This looks a very intuitive behavior to me, but let's make sure we all agree.

Can we briefly discuss this tomorrow? Not sure I fully understand this and I'd be a big fan of the workflow that's downloaded actually containing all the args that were run (though if there's no way to change the defaults on the server-side anymore, the risks here do decrease a bit)

tcompa commented 1 year ago

I'm generally OK with moving in this direction, but this also forces us to take a more strict choice out of the two following options:

  1. Whoever will want to use the task as a Python function (outside Fractal) will have to create objects of the Channel type, e.g. as in

    from fractal_tasks_core.channels import Channel
    from fractal_tasks_core.tasks.create_ome_zarr import create_ome_zarr
    my_channels = [Channel(...something...)]
    create_ome_zarr(..., my_channels, ...)
  2. Since we are now saying that pydantic may become an actual dependency, we can switch from an optional coerce_and_validate=True/False (which is currently part of the fractal_tasks_core.tasks._utils.run_fractal_task function) to always coercing&validating arguments of tasks.

Up to now we were keeping pydantic on the run_fractal_task side only, but if it enters the main function definition then I don't see compelling reasons for this any more.

tcompa commented 1 year ago

Can we briefly discuss this tomorrow? Not sure I fully understand this and I'd be a big fan of the workflow that's downloaded actually containing all the args that were run (though if there's no way to change the defaults on the server-side anymore, the risks here do decrease a bit)

Sure, let's rediscuss it.

What I'm saying, briefly, is that if there is a default for a specific Channel attribute (e.g. start=0) this is irrelevant for fractal-server, since it is not a default value for a function parameter, but it's a default value for an attribute of the type hint of a function parameter. This makes sense to me: fractal-server will never have to introduce an additional channel, but the user will have to do it (e.g. in fractal-web).

tcompa commented 1 year ago

As of the call this morning, we'll proceed with this approach for Channel. Some notes:

  1. The definition will be part of https://github.com/fractal-analytics-platform/fractal-tasks-core/blob/main/fractal_tasks_core/lib_channels.py.
  2. The model will perform some validation (either directly or through a Pydantic validator), and in this way we'll be able to remove some of the preliminary checks from the task functions. For instance validate_allowed_channel_input won't have to verify that wavelength_id is defined, since it will be a required attribute of the class.
  3. Should the model name simply be Channel?
  4. Description of attributes of internal models (e.g. the description of Channel.label) won't propagate to the JSON Schema, for the moment. We can open another issue for that task.

@jluethi feel free to mention other cases where this approach may be relevant, apart from Channel.

jluethi commented 1 year ago

Should the model name simply be Channel?

Given that the channel contains metadata that we put in the omero metadata and should be quite general, I'm happy with going with Channel here.

Description of attributes of internal models (e.g. the description of Channel.label) won't propagate to the JSON Schema, for the moment. We can open another issue for that task.

Ok, those would be useful to have as well eventually, let's track that separately.

@jluethi feel free to mention other cases where this approach may be relevant, apart from Channel.

I will keep an eye on this when reviewing the core tasks default values. One example that comes to mind: Could we make model_type in the cellpose task now of models.MODEL_NAMES now => a defined list of valid models [maybe even something where we can describe the valid options for a potential drop-down selection, but that's a new feature wish] Using models.MODEL_NAMES was not possible with the pydantic approach we were using before, to be tested if it would work now. I'll see if I find other examples.

tcompa commented 1 year ago

For the record, here is the schema for schema['properties']['omero']['properties']['channels']['items'] from ome-ngff 0.4:

{
"type": "object",
"properties": {
  "window": {
    "type": "object",
    "properties": {
      "end": { "type": "number" },            
      "max": { "type": "number" },            
      "min": { "type": "number" },            
      "start": { "type": "number" }
    },
    "required": ["start", "min", "end", "max" ]
  },
  "label": { "type": "string" },
  "family": { "type": "string" },
  "color": { "type": "string" },
  "active": { "type": "boolean" }
},
"required": [ "window", "color" ]
}

to which we are for sure adding some optional first-level properties like:

There is a (minor) decision to take, concerning the window attribute. The correct way to go is that we have a nested model, as in

class ChannelWindow(BaseModel):
    min: str
    max: str
    start: str
    end: str

class Channel(BaseModel):
    wavelength_id: str
    window: ChannelWindow
    label: Optional[str]
    ...

which brings us closer to the ome-ngff schema for omero channels.

This choice, however, will break our current examples, because we introduced some additional first-level attributes like start and end that we later assigned to window["start"] and window["end"]. That is, what we currently have as

{
...
"start": 0,
...
}

will need to become something like

{
"window": {"start": 0, ...}
}

I'm in favor of this change, to move our model closer to the specs, but it's not strictly needed - thus I can avoid introducing it if we are annoyed by the breaking change (@jluethi).

tcompa commented 1 year ago

Some additional context:

What we are currently doing goes in the same direction of https://github.com/JaneliaSciComp/pydantic-ome-ngff (see for an example), which in my view is very relevant for long-term integration with ome-ngff through formal validation against known schemas (or Pydantic models).

In the current (very preliminary!) status of pydantic-ome-ngff:

omero and bioformats2raw are not currently supported, but contributions adding support for those models would be welcome.

That said, I think it'll be worth mentioning this example on that repo, at some point, to show our interest and possibly provide additional motivation. To be clear, I don't think we should depend on that package any time soon, but if it is transformed into an "official" encoding of the specs (we are not there yet), then I'd be quite interested in using it.

tcompa commented 1 year ago

Some more additional context

With no need for comments, see this code block from https://github.com/czbiohub/iohub/blob/main/iohub/ngff_meta.py#L231


class WindowDict(TypedDict):
    """Dictionary of contrast limit settings"""

    start: float
    end: float
    min: float
    max: float

class ChannelMeta(MetaBase):
    """Channel display settings without clear documentation from the NGFF spec.
    https://docs.openmicroscopy.org/omero/5.6.1/developers/Web/WebGateway.html#imgdata
    """

    active: bool = False
    coefficient: float = 1.0
    color: ColorType = "FFFFFF"
    family: str = "linear"
    inverted: bool = False
    label: str = None
    window: WindowDict = None
jluethi commented 1 year ago
wavelength_id: str
coefficient: int = 1
inverted: bool = False

I'm aware of wavelength_id. We don't really do anything with coefficient (don't even know what this is) and inverted, right?

I'm in favor of this change, to move our model closer to the specs, but it's not strictly needed - thus I can avoid introducing it if we are annoyed by the breaking change (@jluethi).

Now's the time for such breaking changes :) I'm also in favor of moving close to their model.

We can't fully adopt their spec for our input though: We will always calculate the correct min & max, never ask the user (they are based on whether the image is an uint8 or uint16. Thus, we will have a slightly different input spec than their definition for what we request as inputs. I'm also not foreseeing asking the user for coefficient (whatever that actually does)

Other than that, big fan of moving closer to them and, in the far future, even depending on a package like that!

tcompa commented 1 year ago

I'm aware of wavelength_id. We don't really do anything with coefficient (don't even know what this is) and inverted, right?

We have always been using these two default values (coefficient=1 and inverted=False), since the major channel refactor. They were also present in the first examples we discussed, and they are part of the transitional OME-NGFF omero metadata.

I can take them away, if there is any reason, or leave them as they are.

https://docs.openmicroscopy.org/omero/5.6.1/developers/Web/WebGateway.html#imgdata

tcompa commented 1 year ago

We can't fully adopt their spec for our input though: We will always calculate the correct min & max, never ask the user

Yes, indeed. We will have to set them as optional, and then assign them inside our own logic. But that's a reasonable trade-off, IMO.

tcompa commented 1 year ago

We can't fully adopt their spec for our input though: We will always calculate the correct min & max, never ask the user

Yes, indeed. We will have to set them as optional, and then assign them inside our own logic. But that's a reasonable trade-off, IMO.

This will also be the case for colormap, which we are assigning as part of our logic (and not requiring from the user).

jluethi commented 1 year ago

I can take them away, if there is any reason, or leave them as they are.

I'd leave them hard-coded until we figure out if we want to do something with them

Yes, indeed. We will have to set them as optional, and then assign them inside our own logic. But that's a reasonable trade-off, IMO.

Agreed, that's a good trade-off for these things! Same for colormap!

tcompa commented 1 year ago

Worth mentioning:

As of https://github.com/fractal-analytics-platform/fractal-tasks-core/pull/410/commits/153fb061a2e38b369f1bc3a3af038ce1f838a0a1, we have a test for define_omero_channels (the function that goes from a list of our-own custom Channel objects to a standard JSON object to be printed in the zarr metatata), which directly validates the output against OME-NGFF JSON Schema.

This makes it clear, once again, that introducing a deeper integration of the OME-NGFF schemas will be a very interesting idea for the future.

jluethi commented 1 year ago

Very nice!

tcompa commented 1 year ago

Quick question: the channel.window attribute is a required one, in ome-ngff.

Should we fail when we try processing an existing zarr file that does not comply (i.e. that doesn't have the window attribute)? Or should we also make window an optional attribute?

(I guess the latter is true, or we will be a bit more strict in what we can process..)

jluethi commented 1 year ago

As far as input to a fractal tasks are concerned, the window should be something users can optionally specify. They may specify a start & end (=> affects display of the image). They should likely never change the min & max (based on uint8 & uint16 etc.)

I thought the omero metadata supports not having the start & end specified. If they are not specified, the napari viewer just handles that as far as I know.

Thus, we only optionally ask start & end from user. We always provide a window but so far, this window always included min & max, while start & end were optional

jluethi commented 1 year ago

Also, the omero metadata is labeled as transitional, see https://ngff.openmicroscopy.org/latest/index.html#omero-md

Thus, I would not expect that all OME-Zarrs we ever process will have an omero metadata. I wouldn't be too strict here thus.

tcompa commented 1 year ago

As far as input to a fractal tasks are concerned, the window should be something users can optionally specify. They may specify a start & end (=> affects display of the image).

What about an existing zarr? Does it have to have windows specified?

EDIT: answer was in the second comment..

tcompa commented 1 year ago

Thus, we only optionally ask start & end from user. We always provide a window but so far, this window always included min & max, while start & end were optional

Agreed. This is the current behavior (in the PR) for when we create a zarr ourselves.

tcompa commented 1 year ago

Also, the omero metadata is labeled as transitional, see https://ngff.openmicroscopy.org/latest/index.html#omero-md Thus, I would not expect that all OME-Zarrs we ever process will have an omero metadata. I wouldn't be too strict here thus.

I'm now merging #410, but another caveat is due (and we can continue in a different issue if it requires a discussion) about how to handle existing OME-Zarrs.

Right now, in #410, I did use the new Channel model in multiple places:

  1. Obviously for the arguments of create_ome_zarr and create_ome_zarr_multiplex
  2. In many other tasks (yokogawa-to-zarr, illumination correction, napari workflows, cellpose) which need to read channels from an existing OME-Zarr.

For case 1, it's OK to be as strict as possible (that is, a user cannot include arbitrary metadata in a newly created OME-Zarr), but this could be an issue for case 2. What should we do if we receive an OME-Zarr created outside Fractal, that includes some non-standard metadata?

With the current version of #410, all these metadata would be discarded upon reading from disk. This is not a severe issue at the moment, because we only read the channels metadata (in case 2) and never write them back to disk. It would become an issue if we have a task that also writes the channel metadata back to disk, because in that case we would be loosing some of the original non-standard metadata.

jluethi commented 1 year ago

Good points.

There are 2 areas where this may become relevant: 1) In the context of taking arbitrary external OME-Zarr files as input => how do we parse their metadata correctly? Do our assumptions work in this broader context? https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/351 2) In the context of using the MD converter task => does that still work? I think it should. See https://github.com/jluethi/faim-hcs/tree/fractal-roi-tables

tcompa commented 1 year ago

For the record, what was described in https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/386#issuecomment-1582477981 is now tested as part of fractal-demos, with create-ome-zarr arguments that look like

{
    "allowed_channels": [
        {
            "color": "00FFFF",                 // renamed from `colormap` to `color`, as per specs
            "wavelength_id": "A01_C01",
            "label": "DAPI",
            "window":{                         // this nesting of `window` is new
                "start": 110,
                "end": 800
            }
        },
        {
            "color": "FF00FF",
            "wavelength_id": "A01_C02",
            "label": "nanog",
            "window": {
                "start": 110,
                "end": 290
            }
        },
        {
            "color": "FFFF00",
            "wavelength_id": "A02_C03",
            "label": "Lamin B1",
            "window": {
                "start": 110,
                "end": 1600
            }
        }
        ]
    ,
    "coarsening_xy": 2,
    "num_levels": 5,
    "image_extension": "png"
}