Expand scope of `write_elem` wrapper

tcompa commented 1 year ago

On the write_elem wrapper: can we take this opportunity to make this a broader wrapper that handles more of the table writing? write_elem seems like a low level wrapper function. Would be great if tasks can just call a write_table or something like that with some parameters. Currently, many task replicate similar handling of the table metadata, updating the list of tables, checking for existing tables etc.

This would be a good opportunity to generalize this and have a useful wrapper that sets the correct metadata for those tables, but that someone building a new task then doesn’t have to think about.

Originally posted by @jluethi in https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/485#issuecomment-1697036561

tcompa commented 1 year ago

Here are the typical ways we create tables. Let's see if/what we can abstract out of them.

Create-ome-zarr task

            # Create tables zarr group for ROI tables
            group_tables = group_image.create_group("tables/")  # noqa: F841
            well_id = row + column

            # Prepare AnnData tables for FOV/well ROIs
            FOV_ROIs_table = prepare_FOV_ROI_table(site_metadata.loc[well_id])
            well_ROIs_table = prepare_well_ROI_table(
                site_metadata.loc[well_id]
            )

            # Write AnnData tables in the tables zarr group
            write_elem(group_tables, "FOV_ROI_table", FOV_ROIs_table)
            write_elem(group_tables, "well_ROI_table", well_ROIs_table)
            group_tables.attrs["tables"] = ["FOV_ROI_table", "well_ROI_table"]

Copy-ome-zarr task

                    # Copy the tables in ROI_table_names
                    for ROI_table_name in ROI_table_names:

                        logger.info(
                            f"I will now read {ROI_table_name} from "
                            f"{zarrurl_old=}, convert it to 2D, and "
                            "write it back to the new zarr file."
                        )
                        ROI_table = ad.read_zarr(
                            f"{zarrurl_old}/{well_path}/{image_path}/"
                            f"tables/{ROI_table_name}"
                        )
                        # Convert 3D FOVs to 2D
                        if project_to_2D:
                            new_ROI_table = convert_ROIs_from_3D_to_2D(
                                ROI_table, pxl_size_z
                            )
                        # Write new table
                        write_elem(
                            new_tables_group, ROI_table_name, new_ROI_table
                        )

Cellpose task

    if output_ROI_table:
        # Concatenate all ROI dataframes
        df_well = pd.concat(bbox_dataframe_list, axis=0, ignore_index=True)
        df_well.index = df_well.index.astype(str)
        # Extract labels and drop them from df_well
        labels = pd.DataFrame(df_well["label"].astype(str))
        df_well.drop(labels=["label"], axis=1, inplace=True)
        # Convert all to float (warning: some would be int, in principle)
        bbox_dtype = np.float32
        df_well = df_well.astype(bbox_dtype)
        # Convert to anndata
        bbox_table = ad.AnnData(df_well, dtype=bbox_dtype)
        bbox_table.obs = labels
        # Write to zarr group
        group_tables = zarr.group(f"{in_path}/{component}/tables/")
        write_elem(group_tables, output_ROI_table, bbox_table)
        logger.info(
            "Bounding box ROI table written to "
            f"{in_path}/{component}/tables/{output_ROI_table}"
        )

Napari-workflows task

    # Output handling: "dataframe" type (for each output, concatenate ROI
    # dataframes, clean up, and store in a AnnData table on-disk)
    for (name, out_params) in dataframe_outputs:
        table_name = out_params.table_name
        # Concatenate all FOV dataframes
        list_dfs = output_dataframe_lists[name]
        if len(list_dfs) == 0:
            measurement_table = ad.AnnData()
        else:
            df_well = pd.concat(list_dfs, axis=0, ignore_index=True)
            # Extract labels and drop them from df_well
            labels = pd.DataFrame(df_well["label"].astype(str))
            df_well.drop(labels=["label"], axis=1, inplace=True)
            # Convert all to float (warning: some would be int, in principle)
            measurement_dtype = np.float32
            df_well = df_well.astype(measurement_dtype)
            df_well.index = df_well.index.map(str)
            # Convert to anndata
            measurement_table = ad.AnnData(df_well, dtype=measurement_dtype)
            measurement_table.obs = labels
        # Write to zarr group
        group_tables = zarr.group(f"{in_path}/{component}/tables/")
        write_elem(group_tables, table_name, measurement_table)
        # Update OME-NGFF metadata
        current_tables = group_tables.attrs.asdict().get("tables") or []
        if table_name in current_tables:
            # FIXME: move this check to an earlier stage of the task
            raise ValueError(
                f"{in_path}/{component}/tables/ already includes "
                f"{table_name=} in {current_tables=}"
            )
        new_tables = current_tables + [table_name]
        group_tables.attrs["tables"] = new_tables
        # FIXME: also include table metadata, see issue #333

tcompa commented 1 year ago

Something that clearly can be moved into write_table is the creation of the tables group (if needed), and the check on whether the required subgroup already exists (to be handled via overwrite). EDIT: also the tables metadata are always handled in the same way.

This means that the first version of this function could look like

def write_table(image_group, table_name, table, overwrite=False):
    # Create tables group, if needed
    ...
    # Check whether subgroup already exists, and proceed accordingly
    ...
   # If it's all OK, proceed and write the table
   ...
   # Update the `tables` metadata of the image group
   ...

Let's see if we can include more general features in this function.

tcompa commented 1 year ago

The first prototype is available through https://github.com/fractal-analytics-platform/fractal-tasks-core/commit/f4f127e59de78a8aa0367e9628af306b817dbb19, comments on the scope are welcome.

jluethi commented 1 year ago

I like the approach to write_table a lot, especially that handling all the metadata attributes now go into a wrapper, not the task function. Will make it much easier to be consistent on those. I don't fully understand yet how all of them are set and whether we handle e.g. well_ROI_table correctly (see comments). But great that we'll have one place to handle that.

tcompa commented 1 year ago

I improved the prototype function (including your suggestion) with 9e5b5e8 and 45b4d50fcb887d85d1eba21582a3d59742e0b88f, see current status of PR #499. Let me know if something is still unclear or can be improved.

More changes may come from actually integrating this function in the tasks.

jluethi commented 1 year ago

Looks good to me! We can then work more on this wrapper when we formalize ROIs more (e.g. what metadata should our default ROI tables have?). It's already a great first step towards it :)

fractal-analytics-platform / fractal-tasks-core