fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
12 stars 5 forks source link

Mismatch between table specs and implementation, re: `instance_key` column #617

Closed tcompa closed 7 months ago

tcompa commented 7 months ago

What we wrote in the table specs is that for tables of type masking_roi_table and feature_table the following condition is needed

table must include the column which is defined in its instance_key attribute (e.g. the label column

This is a very natural way of storing the information that links to another table, but it does not correspond to the current fractal-tasks-core implementation, where the list of labels is stored within the obs attribute of the AnnData object - see for instance these snippet where we write/read the table:

# Write table (and labels)
            # Extract labels and drop them from df_well
            labels = pd.DataFrame(df_well["label"].astype(str))
            df_well.drop(labels=["label"], axis=1, inplace=True)
            # Convert all to float (warning: some would be int, in principle)
            measurement_dtype = np.float32
            df_well = df_well.astype(measurement_dtype)
            df_well.index = df_well.index.map(str)
            # Convert to anndata
            measurement_table = ad.AnnData(df_well, dtype=measurement_dtype)
            measurement_table.obs = labels

# Read labels
    column_name = attrs["instance_key"]
    # Check that ROI_table.obs has the right column and extract label_value
    if column_name not in ROI_table.obs.columns:
        raise ValueError(
            'In _preprocess_input, "{column_name}" '
            f" missing in {ROI_table.obs.columns=}"
        )
    label_value = int(ROI_table.obs[column_name][ROI_positional_index])

I'm not 100% sure of why we made this choice in the past, but my vague memory is that it was related to the fact that we wanted labels to be string and maybe it was hard to place them in the AnnData X attribute.


What should we do? (cc @jluethi)

Two options:

  1. Keep current implementation and update the specs, saying the labels (or whatever is defined as instance_key) go in the obs attribute. This is the one that takes least effort.
  2. Update the implementation and store labels as an additional column. This may or may not be possible, if we want to guarantee that labels are string - TBD. The most annoying part would obviously be that we must be backwards compatible, and also accept labels which are stored in the obs attribute.
jluethi commented 7 months ago

Ah, we missed that one!

In the original table spec proposal, it said:

"instance_key", which is the key in obs that denotes which instance in "region" the row corresponds to

Thus, we were using obs because the proposal was using obs. Also, given that labels should be integers (which can be stored as strings fine), I guess saving them as floats can be suboptimal (though could be handled with casting to int with correct rounding).

In any case, having labels in obs seems to be a standard way to use AnnData. The reasoning being that this is not a measurement of the object, but something that describes which object is measured => should go in obs.

Thus, let's go with 1:

  1. Keep current implementation and update the specs, saying the labels (or whatever is defined as instance_key) go in the obs attribute. This is the one that takes least effort.