fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Cellpose normalization #738

Closed fstur closed 2 weeks ago

fstur commented 1 month ago

Checklist before merging

The cellpose_segmentation task currently accepts a custom normalization option. The problem is, that if two channels are used for the cellpose segmentation, both channels are normalized with the same CellposeCustomNormalizer, but one might want to normalize them with different parameters. I added an optional normalize2 argument to cellpose_segmentation, and updated segement_ROI accordingly. No tests are implemented yet.

jluethi commented 3 weeks ago

Thanks a lot for the PR @fstur ! Very useful functionality to have indeed.

As discussed today, @lorenzocerrone and I refactored the Cellpose input models to handle the normalization aspects better: Normalization can now be specified when selecting the channel. This needed a bit of reworking to make it work with the Pydantic models afterwards. I also used this opportunity to move a bit of code out of the task logic into the new CellposeChannel1InputModel.

Things that remain:


Unrelated (cc @tcompa ): I'm suggesting we disabled flake8 E203 checks, because this block has a flake8 vs. black conflict about whether there should be spaces around the colon:

x[channels[0] - 1 : channels[0]] = normalized_img(
                x[channels[0] - 1 : channels[0]],
                lower_p=normalize.lower_percentile,
                upper_p=normalize.upper_percentile,
                lower_bound=normalize.lower_bound,
                upper_bound=normalize.upper_bound,
            )

It's also what made the checks above fail before.

I also added an exception for flake8 W503, which conflicts with black in how to format the cellpose tests. (unrelated: do we want to consider moving to ruff linting?)

tcompa commented 3 weeks ago

Unrelated (cc @tcompa ): I'm suggesting we disabled flake8 E203 [...]

Agreed. This handling of : also annoys me, on top of often requiring # noqa comments.

(unrelated: do we want to consider moving to ruff linting?)

Sure, nothing against it. It's not like pre-commit takes very long (time poetry run pre-commit run --all-files takes ~1.7 seconds on my machine, which includes reordering imports, black, flake8 and bandit), but of course a speedup wouldn't hurt. Once we have it in this repo, let's also make sure we adopt it in fractal-server and fractal-tasks-core - to keep them consistent.

jluethi commented 3 weeks ago

Hey @fstur I finished my work on this PR. It now has the channel & normalization input combined, plus the whole normalization logic moved to its own, tests helper function (see _normalize_cellpose_channels). I also modified the advanced model inputs into a CellposeModelParams object that can be passed to the task, so the logic of the task input changes a bit both on the Python side as well as the Fractal web side.

Can you check whether this version now works for you? I'd be happy with merging this into fractal tasks core & then providing the normalization approach for both channels.


@tcompa Can you review this PR?

One of the slightly weird things I'm doing is making channel2 optional in the task signature to avoid it being expanded in Fractal web. Because the whole block is already quite big. Fractal web always initializes an object for it though, because the Normalizer has default values => I'll open a separate issue to discuss this further.


@lorenzocerrone If you also want to have a look, open for further feedback :) But not a requirement that you review the PR.

tcompa commented 3 weeks ago

One of the slightly weird things I'm doing is making channel2 optional in the task signature to avoid it being expanded in Fractal web

This comment is not relevant any more as of https://github.com/fractal-analytics-platform/fractal-tasks-core/pull/738/commits/b63547f5c0295d812c10d5359053d561da6020c5, right?

jluethi commented 3 weeks ago

This comment is not relevant any more as of https://github.com/fractal-analytics-platform/fractal-tasks-core/commit/b63547f5c0295d812c10d5359053d561da6020c5, right?

Right! Now I made the normalize part of channel have a default. Would be great if you could review if the use of Field with default_factory mode is a good approach for this

tcompa commented 3 weeks ago

Re: channel2 type hint and JSON Schema

I could not find a better way for making channel2 fully optional (as in "it is set to null in fractal-web UI unless at lease one of its attribute was set explicitly by the user"). Thus I'd proceed as in the PR: the is_set() method is quite clear in its meaning, and it doesn't introduce a lot of complexity.

Higher-level: I agree about

Fractal web always initializes an object for it though, because the Normalizer has default values

and this is what blocks us towards a clean version where channel2 is fully optional. Note that this issue is at least partly a consequence of fractal-web logic: when working with a CLI client we wouldn't have issues in setting a None default for an Optional function argument.


I'll open a separate issue to discuss this further.

I think it's worth continuing this discussion (in a different issue, indeed), because it sounds like a possibly frequent pattern. Note that the outcome in principle could also look like "refactor the CellposeCustomNormalizer model so that it has no default value", since this would likely solve the whole issue.

[continues below, concerning default_factory]

tcompa commented 3 weeks ago

Re: default_factory

Using arg: Field(default_factory=Model) rather than arg: Model = Model() has a difference, but I think it is not relevant for this case. Note that the latter is the one we are currently using in main for https://github.com/fractal-analytics-platform/fractal-tasks-core/blob/00cbfdfc062d6c615cef073b29d4628224903fa0/fractal_tasks_core/tasks/cellpose_segmentation.py#L214

The main difference is that the produced JSON Schemas are different. In the example below, I reproduce a minimal example with two function arguments (one always using default_factory, and the other using the same Model() option):

Python

from typing import Optional, Literal
from pydantic import BaseModel, Field

class Normalizer(BaseModel):
    """
    Description of Normalizer

    Attributes:
        type: Description of type
        time: Description of time
    """
    type: Literal["option_a", "option_b"] = "option_a"

class MyModelDefaultFactory(BaseModel):
    """
    Description of ChannelModel

    Attributes:
        normalizer: Description of normalizer.
        time: Description of time.
        attr: Description of attr1.
    """
    normalizer: Normalizer = Field(default_factory=Normalizer)
    time: float = Field(default_factory=time.perf_counter)
    attr: Optional[str] = None

class MyModelObjectInit(BaseModel):
    """
    Description of ChannelModel

    Attributes:
        normalizer: Description of normalizer.
        time: Description of time.
        attr: Description of attr1.
    """
    normalizer: Normalizer = Normalizer()
    time: float = time.perf_counter()
    attr: Optional[str] = None

@validate_arguments
def function(
        arg_default_factory: MyModelDefaultFactory = Field(default_factory=MyModelDefaultFactory),
        arg2: MyModelObjectInit = MyModelObjectInit(),
        ):
    """
    Description of function

    Args:
        arg_default_factory: Description of arg_default_factory
        arg2: Description of arg2
    """
    pass

which leads to

{
  "title": "Function",
  "type": "object",
  "properties": {
    "arg_default_factory": {
      "$ref": "#/definitions/MyModelDefaultFactory",
      "title": "Arg_Default_Factory",
      "description": "Description of arg_default_factory"
    },
    "arg2": {
      "title": "Arg2",
      "default": {
        "normalizer": {
          "type": "option_a"
        },
        "time": 24457.131491715,
        "attr": null
      },
      "allOf": [
        {
          "$ref": "#/definitions/MyModelObjectInit"
        }
      ],
      "description": "Description of arg2"
    }
  },
  "additionalProperties": false,
  "definitions": {
    "Normalizer": {
      "title": "Normalizer",
      "description": "Description of Normalizer",
      "type": "object",
      "properties": {
        "type": {
          "title": "Type",
          "default": "option_a",
          "enum": [
            "option_a",
            "option_b"
          ],
          "type": "string"
        }
      }
    },
    "MyModelDefaultFactory": {
      "title": "MyModelDefaultFactory",
      "description": "Description of ChannelModel",
      "type": "object",
      "properties": {
        "normalizer": {
          "$ref": "#/definitions/Normalizer",
          "title": "Normalizer"
        },
        "time": {
          "title": "Time",
          "type": "number"
        },
        "attr": {
          "title": "Attr",
          "type": "string"
        }
      }
    },
    "MyModelObjectInit": {
      "title": "MyModelObjectInit",
      "description": "Description of ChannelModel",
      "type": "object",
      "properties": {
        "normalizer": {
          "title": "Normalizer",
          "default": {
            "type": "option_a"
          },
          "allOf": [
            {
              "$ref": "#/definitions/Normalizer"
            }
          ]
        },
        "time": {
          "title": "Time",
          "default": 24457.131491715,
          "type": "number"
        },
        "attr": {
          "title": "Attr",
          "type": "string"
        }
      }
    }
  }
}

The main difference here is the presence of

 "default": {
        "normalizer": {
          "type": "option_a"
        },
        "time": 24457.131491715,
        "attr": null
      },

for arg2, while it is missing when we use default_factory. Part of the explanation is that default_factory should not be filled upon schema generation (see e.g. https://github.com/pydantic/pydantic/issues/1520), and the time attribute that I include above helps to show this behavior (it is present when it's called explicitly, but not when using a default_factory=time.perf_counter).

Does this matter for fractal-web?

In the fractal-web UI, this difference appears as irrelevant. If I don't modify any form field, current data are identical for the two cases (apart from the time attribute, but that is expected): image

Does this matter for fractal-server?

For top-level arguments (that is, this does not concern nested attribute of complex arguments), the presence/absence of a default in the JSON Schema does matter in fractal-server. If a default value is present, then it is used as a basis during any insert-worfklowtask or patch-workflowtask operation. Considering the example above:

  1. If we always go through fractal-web, there will always be a client-provided value (and therefore the default is not relevant).
  2. If we use a different client, then the usage based on default_factory will have no default in the arguments JSON file, and the default will be generated at runtime via the default_factory.

I cannot find any issue with either option, right now, but I wouldn't be surprised if we eventually identify some edge case of one or the other.

[continues with a TL;DR]

tcompa commented 3 weeks ago

TL;DR re: default_factory

  1. The advantage of using default_factory is that we are not using a mutable function argument (which is a bad practice, but irrelevant for the standard command-line usage of fractal tasks)
  2. The advantage of using arg: MyModel = MyModel() is that the default field in the JSON Schema is set, making it more transparent.
  3. The two cases appear in the same in fractal-web, for my dummy example and also for the current PR.
  4. I cannot think of an actual problematic edge case related to using one or the other.

If I were to choose, I would replace

channel2: CellposeChannel2InputModel = Field(default_factory=CellposeChannel2InputModel)

with

channel2: CellposeChannel2InputModel = CellposeChannel2InputModel()

which creates a more transparent JSON Schema that includes

-            "$ref": "#/definitions/CellposeChannel2InputModel",
             "title": "Channel2",
+            "default": {
+              "wavelength_id": null,
+              "label": null,
+              "normalize": {
+                "type": "default",
+                "lower_percentile": null,
+                "upper_percentile": null,
+                "lower_bound": null,
+                "upper_bound": null
+              }
+            },
+            "allOf": [
+              {
+                "$ref": "#/definitions/CellposeChannel2InputModel"
+              }
+            ],

but this is just personal preference and I have no specific reasons for it.

tcompa commented 3 weeks ago

This closes my feedback on the arguments/schema-related part of the PR.

I can still review the rest of the PR, if it's useful, but that will be a bit later.

jluethi commented 2 weeks ago

Thanks for the review & the reasoning on trade-offs @tcompa . We'll stay with this version for the time being and will evaluate later whether we'll have a strong opinion one way or the other for Fractal tasks.

jluethi commented 2 weeks ago

Thanks @fstur for starting this work, I'm now merging this PR so we have the updated version of the Cellpose task with easier input handling & normalize options for both channels available in main :)