fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Introduce Pydantic models for NGFF metadata #528

Closed tcompa closed 11 months ago

tcompa commented 11 months ago

close #351 close #501 close #532 close #533


The main goal is to extract some ome-ngff-related metadata directly from an ome-zarr, rather than from the metadata input parameter (ref #351). This is a clear goal, that can be achieved in multiple ways. The simplest one is to have several helper functions that go and construct a given parameter (e.g. num_levels) based on the attributes of an OME-Zarr group.

With the current PR, I'm proposing a different approach that brings in broader changes. I'm introducing Pydantic models that encode (part of) OME-NGFF specs, and then implementing additional logic as part of these models (via properties or methods), e.g. to extract num_levels. Instead of writing the Pydantic models myself, I generated them automatically from the JSON Schema of 0.4 NGFF (via https://github.com/koxudaxi/datamodel-code-generator) and then tweaked them a bit - resulting e.g. in a NgffImage model - see also note below on effort duplication.

If we pursue this route, then most of the functions in lib_zattrs_utils.py will become methods of the appropriate NGFF object (e.g. NgffImage or NgffWell). The example I already included in the preliminary commits of this PR is that also extract_zyx_pixel_sizes can be re-written in a simpler way as a method of NgffImage. Example:

# Current status
from fractal_tasks_core.lib_zattrs_utils import extract_zyx_pixel_sizes
def yokogawa_to_ome_zarr(..):
    ...
    parameters = get_parameters_from_metadata(
        keys=[
             "original_paths",
             "num_levels",
             "coarsening_xy",
             "image_extension",
             "image_glob_patterns",
         ]
     )
    num_levels = parameters["num_levels"]
    coarsening_xy = parameters["coarsening_xy"]
    pxl_size = extract_zyx_pixel_sizes(f"{zarrurl}/.zattrs")
    ...

# This PR
def yokogawa_to_ome_zarr(..):
    ...
    ngff_image = NgffImage(**zarr.open_group(zarrurl).attrs.asdict())
    num_levels = ngff_image.num_levels
    coarsening_xy = ngff_image.coarsening_xy
    pxl_size = ngff_image.get_pixel_sizes_zyx(level=0)
    ...
    parameters = get_parameters_from_metadata(
        keys=[
             "original_paths",
             "image_extension",
             "image_glob_patterns",
         ]
     )
    ...

Scope of this change

To be clear: this new lib_image.py module (to be renamed into lib_ngff.py as soon as we also include Well or something else) may become an important one, since it will also contribute to defining our compliance/support w.r.t OME-NGFF. Some aspects:

  1. We now have multiple places in the code base where we check for a single OME-Zarr property (e.g. we can only accept a single multiscale per image). This information would be more centralized if a set of validators were in place for each Well/Image/Plate model...
  2. We now have a couple of check that the NGFF version is 0.4, and we are not supporting anything else. In principle the use of lib_ngff.py modules would also be one way to simplify support for multiple versions in the future (e.g. by using NgffImageV04 or NgffImageV05 based on the version stored in the zarr attributes).
  3. Having a more strict structure of models won't make any difference w.r.t. #150, but my opinion is that it could make some internal functions simpler (see for instance the current way to extract pixel sizes in the PR, which can "natively" go and look for axes and their names as object attributes).

This PR may also affect:

A note on effort duplication

The same procedure (transorming at least a subset of the NGFF specs into Python models) is already done elsewhere:

If/when one of these efforts grows and/or becomes the officially sanctioned Pydantic version of OME-NGFF (or something we want to rely on, for whatever reason), then we should consider using their models instead of our "custom" ones. The transition would take place by defining a FractalNgffImage class that inherits from the "official" one and adds additional methods/properties that are needed for fractal-tasks-core tasks (e.g it would expose a num_levels property).

Neglecting the different attribute names, the change could be as smooth as in:

# This PR

class NgffImage(BaseModel):  # <--------- fractal-specific class
    multiscales: list[Multiscale] = Field(
        ...,
        description="The multiscale datasets for this image",
        min_items=1,
        unique_items=True,
    )
    omero: Optional[Omero] = None

    @property
    def multiscale(self) -> Multiscale:
        if len(self.multiscales) > 1:
            raise NotImplementedError(
                "Only images with one multiscale are supported "
                f"(given: {len(self.multiscales)}"
            )
        return self.multiscales[0]

    @property
    def axes(self) -> list[Axe]:
        return self.multiscale.axes

    @property
    def datasets(self) -> list[Dataset]:
        return self.multiscale.datasets

    @property
    def num_levels(self) -> int:
        return len(self.datasets)

# A possible future version

from xyz import NgffImage

class FractalNgffImage(NgffImage):   # <--------- use official model as parent class

    @property
    def multiscale(self) -> Multiscale:
        if len(self.multiscales) > 1:
            raise NotImplementedError(
                "Only images with one multiscale are supported "
                f"(given: {len(self.multiscales)}"
            )
        return self.multiscales[0]

    @property
    def axes(self) -> list[Axe]:
        return self.multiscale.axes

    @property
    def datasets(self) -> list[Dataset]:
        return self.multiscale.datasets

    @property
    def num_levels(self) -> int:
        return len(self.datasets)
github-actions[bot] commented 11 months ago

Coverage report

The coverage rate went from 89.89% to 90.2% :arrow_up: The branch rate is 83%.

98.19% of new lines are covered.

Diff Coverage details (click to unfold) ### fractal_tasks_core/lib_ngff.py `98.01%` of new lines are covered (`94.08%` of the complete file). Missing lines: `414`, `415`, `419` ### fractal_tasks_core/lib_zattrs_utils.py `100%` of new lines are covered (`96.15%` of the complete file). ### fractal_tasks_core/tasks/apply_registration_to_ROI_tables.py `100%` of new lines are covered (`90.17%` of the complete file). ### fractal_tasks_core/tasks/apply_registration_to_image.py `90.9%` of new lines are covered (`79.22%` of the complete file). Missing lines: `153` ### fractal_tasks_core/tasks/calculate_registration_image_based.py `100%` of new lines are covered (`91.22%` of the complete file). ### fractal_tasks_core/tasks/cellpose_segmentation.py `100%` of new lines are covered (`85.6%` of the complete file). ### fractal_tasks_core/tasks/copy_ome_zarr.py `100%` of new lines are covered (`89.42%` of the complete file). ### fractal_tasks_core/tasks/create_ome_zarr.py `100%` of new lines are covered (`82.9%` of the complete file). ### fractal_tasks_core/tasks/illumination_correction.py `100%` of new lines are covered (`82.01%` of the complete file). ### fractal_tasks_core/tasks/maximum_intensity_projection.py `100%` of new lines are covered (`87.01%` of the complete file). ### fractal_tasks_core/tasks/napari_workflows_wrapper.py `100%` of new lines are covered (`90.3%` of the complete file). ### fractal_tasks_core/tasks/yokogawa_to_ome_zarr.py `100%` of new lines are covered (`91.37%` of the complete file).
jluethi commented 11 months ago

Great. Let's adapt wording to make clear we're just loading metadata, not image data

tcompa commented 11 months ago

Coverage report

The coverage rate went from 89.8% to 89.93% ⬆️ The branch rate is 83%.

96.89% of new lines are covered. Diff Coverage details (click to unfold)

For some unclear reason this comment is not up-to-date, and the coverage of the new lib_ngff.py module is at 98% (only three lines are missing, and they are the exception handling in load_NgffWellMeta).

tcompa commented 11 months ago

For some unclear reason this comment is not up-to-date, and the coverage of the new lib_ngff.py module is at 98% (only three lines are missing, and they are the exception handling in load_NgffWellMeta).

The coverage comment is not working as expected, but I checked by hand that there are no visible regressions.

tcompa commented 11 months ago

This PR is essentially ready on my side. Its contents is briefly summarized as:

  1. Introduce (a first version of) Pydantic models for NGFF images and wells; based on some autogenerated ones, and then with quite some clean up. These models also several constraints which we have in-place (e.g. the fact that we only support images with a single multiscale), thus reducing the number of checks in the tasks.
  2. Extract num_levels and coarsening_xy parameters from NGFF objects, rather than from metadata task input .
  3. Replace several helper functions (get_axes_names, extract_zyx_pixel_sizes and get_acquisition_paths), that are now methods/properties of the Pydantic models.
  4. Load Zarr attributes from groups, rather than from .zattrs files.

Relevant issues closed by this PR:

It is quite a large PR, but at least a high-level review would be very useful - cc @jluethi.

tcompa commented 11 months ago

I added a few minor comments on the NGFF pydantic models and some questions on whether some lines should now be moved to a different part of the task

These should all be covered now.

I didn't review the test changes.

That's OK. I'm mostly unit-testing all the new models, and then updated existing tests that were relying on the old helper functions.

tcompa commented 11 months ago

All seems good, and we have new issues in-place for parts that we postponed (e.g. #540 or #150). Merging.