Clinician customization of dataset at add time #360

Open srcansiz opened 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:42

(was SP17 - Item 01)

As a clinician, i want to be able to easily (GUI or CLI, no coding) customize a dataset when adding (sharing) it on the node, so as to present a more homogeneous interface to the researcher by hiding certain specificities of my node's environment/file system/file structure/etc..

As a clinician, i want to be able to re-use dataset customizations previously defined.

As a developer, i want the implementation to be generic for all datasets and all types of add-time customizations, and backward compatible with existing dataset implementations.

Implementation requirement:

We should consider the filesystem as read-only. No modification or copy of the dataset is made. This is distinct from the data manager's preparing a dataset (data formatting, pseudonymization, cleaning) before making it available to a node.
All interactions with the data should continue to be done only through the DataLoader (contributes to genericity)
No code modification should be needed for datasets that don't use a DataLoadingPlan

Tasks:

~~add node side DataLoadingPlan~~
- ~~[ ] introduce DataLoadingPlan, DataPipeline, DLP table + adapt training-time code~~ issue #318
- ~~[ ] GUI extension for DLP~~ issue #320
- ~~[ ] CLI extension for DLP~~ issue #320
create MedicalFolderDataset customizations for GUI and CLI
- ~~[ ] mapping data modalities~~ issue #319 #320
- ~~[ ] mapping subject indices~~ issue #321
- ~~[ ] providing good default values~~ issue #321
[ ] clean out MedicalFolderDataset from dataset_parameters mechanism (complete replaced by DLP)
[ ] nice to have: update dataset implementations to use DataLoadingPlan if useful instead of current adhoc solutions (eg: CSV dataset's dtypes ?)
[ ] nice to have: GUI permits export of a csv template (only the headers/column names) from an already loaded MedicalFolderDataset
[ ] page for DLP view and delete (replaces #341)
[ ] mapping subject indices and providing good default values in MedicalFolderDataset (replaces #321)

Future extension - fully specify and implement researcher side data loading plan

Add node side DataLoadingPlan (DLP)

The goal of this task is to implement DataLoadingPlan (DLP) mechanism in the node. DataLoadingPlan is currently a pure node side notion (no modification of the researcher side or control on the researcher side at this point).

Note: same dataset can already be shared multiple times simultaneously (eg using different DLP), as long as different dataset tags are used.

Create classes `DataLoadingPlan` and `DataPipeline`

This paragraph describes the projected implementation for DLP mechanism. The final implementation may deviate from the projected implementation, but please discuss any proposed deviations with the team first.

[name=Francesco] I think we should also address the question: "Why don't we simply make a small change to the MedicalFolderDataset class?" One answer could be that we are trying to introduce a more general framework, but the reality is that we are only using it for MedicalFolderDataset for the moment. In my opinion, it is worth to discuss with the rest of the team if they are comfortable with this relatively more complex implementation.

introduce a DataLoadingPlan class. This class describes the customizations applied to a dataset for a specific "add" of the dataset. It includes:
- a unique ID identifying this DLP
- an arbitrary optional name describing the DLP
- the type of data this DLP can be applied to. Values are same as dataset['data_type'], so type is currently a str but would be cleaner implementation introducing an enum type.
- a list of DataPipelines
  
  [name=Francesco] suggest to add: as saved by their respective save functions
- functions for saving/loading the DLP to/from the database
  [name=Francesco] suggest to add: (this should also call the save function of all its pipelines). Ah, I see now that you added it in the example code. In terms of implementation of the class, an alternative could be that DataLoadingPlan subclasses list. We could discuss with the team the pros/cons of different implementations.
```
class DataLoadingPlan():
def __init(name: str, data_type: str):
self._pipelines = []
self._dlp_id = 'dlp_' + str(uuid.uuid4()) # some unique identifier
self._name = name
self._data_type = data_type
```
  def add_pipeline(dp: DataPipeline): self._pipelines.append(dp)
  
  def save(): ... for p in self._pipelines: p.save() def load(): ...
  
  ... # need to define all needed functions for class
introduce a DataPipeline class. A data pipeline describes an intermediary layer between the researcher and the file structure on the node. It serves as a way for the node to pre-define (at add time) a layer of indirection between what the researcher is able to manipulate and what actually exists on the node. It is usually (always ?) used through a daughter class (eg: ModalitiesDP).

[name=Francesco] In terms of implementation, it is not clear why we even need a base class for this. One reason for this would be to encourage the use of a certain interface, but then the base class should be designed a bit differently (e.g. it should be an abstract class). We may also consider the mixin pattern or the visitor pattern, but I'm not super familiar with those so we would need to discuss their actual viability.

A data pipeline always contains:

a data structure describing the pipeline

[name=Francesco] I think this requires more thinking/discussion. A more general alternative would be an apply function, for example.
functions for saving/loading the pipeline to/from the database (called from the DataLoadingPlan save/load)
[name=Francesco] a default constructor (see warning box below for explanation)

Daughter data pipeline class may add more data structure and/or functions related to this specific pipeline.

class DataPipeline():
    def __init__():
        self._data = {}

    def set_data(data):
        self._data = data

    def save():
        ...
    def load():
        ...

class ModalitiesDP(DataPipeline):

    # eg: a function specific to `MedicalFolderDataset` modality mapping
    # Get the retained modality (set by clinician) corresponding to a detected modality (folder name)
    def modality(detected_modality: str) -> str:
        # look in self._data for associated modality
        return ...

introduce database support for DLP: use the same tinyDB as for datasets and models, create a dedicated data_loading_plans table for DLPs.

:::warning :warning: When saving a data pipeline to database, the class name for the pipeline must also be saved (when loading the pipeline, we must know which class to instantiate)
[name=Francesco] we need to define some details about the save and load process. One idea could be:
1. the DataPipeline is created with a default constructor (does not take any parameters)
2. the load function is called immediately after, and its purpose is to update the internal parameters of the class based on the information saved by the save function. This means, for example, that the DataPipeline class must have at least a default constructor

Example of database entry of a DLP (with the ModalitiesDP() containing modality association data):

    {
    'dlp_id': my_unique_id,
    'dlp_name': my_optional_name,
    'dlp_data_type': my_str_or_enum_data_type,
    'pipelines': [
        {
            'pipeline_class': 'ModalitiesDP',
            'data': {
                'T1': [ 'T1siemens', 'T1philips' ],
                'T2': [ 'T2hitachi', 'T2other' ],
                'label': [ 'label' ]
            }
        },
        ...
    ]
    }

info: This item explains the principle of interaction between DataLoadingPlan and dataset classes. No code/action is included in this item, implementation is covered by later tasks in this milestone.

No modication of the existing datasets code is needed, as long as they don't use DLPs.

No modification is needed on the researcher side (training plan's code for instantiating the dataset), whether or not the dataset uses DLP. For example, for the medical image segmentation notebook the instantiation of MedicalFolderDataset and its usage are unchanged:
```
def training_data(self,  batch_size = 4):
        ...
    # same prototype, same semantics
    dataset = MedicalFolderDataset(
        root=self.dataset_path,
        data_modalities='T1',
        target_modalities='label',
        transform=training_transform,
        target_transform=target_transform,
        demographics_transform=UNetTrainingPlan.demographics_transform)
```
When a dataset wants to use a DLP it needs to have a set_dlp() function. The function is used for attaching a data loading plan to the dataset. Example:
```
class MedicalFolderDataset(...):

    def set_dlp(dlp: DataLoadingPlan)
```
A dataset that wants to use a DLP must also be modified so that it interprets the content of the DLP:
- robustness/errors
  - completely ignore if dlp.data_type() does not match dataset's type.
  - for each pipeline, check if it apply to this dataset (eg: the MedicalFolderDataset must know what to do with a ModalitiesDP). Ignore (with warning message ?) or fail when a pipeline does not match with the dataset.
  - properly catch exceptions when executing pipeline functions: we accept to use a DLP with any dataset of same data type, but we are not sure it can work (errors may happen).
- processing
  - modify processing to match the DLP. E.g. for MedicalFolderDataset and ModalitiesDP the MedicalFolderDataset.__getitem__() must map the modalities, filter out samples that don't have all modalities after mapping, etc.
A dataset that wants to use a DLP must also have:
- its GUI/CLI interface modified to create a DLP + reuse a previously created DLP (see later tasks for MedicalFolderDataset)
- the dataset code behind the GUI/CLI "Add" (eg: MedicalFolderController for medical folder dataset) modified to create a DLP. Example:
```
```
      mf_controller = MedicalFolderController()
  
      # extend `MedicalFolderController` with a method to return the created DLP
      mf_dlp = mf_controller.dlp()
      # save the created DLP
      mf_dlp.save()
```
```

introduce support for DLP at training/testing time.

Updating node's Round with the following seems enough (to be confirmed):

class Round(...):
    def _split_train_and_test_data(...):
        ...
        # after `set_dataset_parameters`, very similar

        if hasattr(data_manager.dataset, 'set_dlp') and callable(data_manager.dataset.set_dlp)
            and hasattr(data_manager.dataset, 'dlp') isinstance(data_manager.dataset.dlp, DataLoadingPlan):
            try:
                data_manager.dataset.set_dlp(data_manager.dataset.dlp)
            except Exception as e:
                ... # may fail eg if the DLP does match with other datasets of same type but not this one

        # unchanged
        return data_manager.split(test_ratio=test_ratio)

GUI extension for DLP

Projected implementation: node GUI is extended with a new menu (left side bar) for handling DLPs:

when entering the menu, see all existing DLPs
when clicking on a DLP, see DLP details
a Delete button permits to remove a DLP
at this point, no "DLP edit" facility will be added

Same functionality is added to the CLI.

create MedicalFolderDataset customizations

The goal is to use the DataLoadingPlan (DLP) mechanism to:

enable the clinician to customize a MedicalFolderDataset when sharing it through the GUI or the CLI
re-use previously defined DLPs when later sharing again the dataset.

DLP replaces the current adhoc dataset_parameters mechanism introduced

First part: mapping data modalities

The goal is to map data modalities with folders having different naming patterns to the same naming convention.

So this task focuses on the subfolders of the MedicalFolderDataset's subject folders (== the folders containing the imaging data for the different modalities). Basically, we considered the folder name to be the modality, but this is not flexible enough.

The assumption is that multiple naming patterns for the subfolders may correspond to the same modality. However, we assume that there will not be arbitrarily many naming patterns for a single modality (max 10).

We need to provide the node a way to link different "modality folders" names to the names used to identify data modalities

rephrased: we need a template for letting the clinicians specify the redirection from folder names to modality names.

Providing a modality mapping is mandatory (do not permit to share a dataset without specifying it).

GUI implementation worklow

Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":

add step (0): (x) is an optional check box. If clicked, [data loading plans] drop down menu appears which lists saved data loading plans for same dataset[data_type]. If no such saved loading plans, don't show menu ("no saved loading plans for this data type"). When clinician selects one data loading plan, apply this data loading plan's parameters to the current GUI page/workflow.
```
(x) Use existing customizations
<data loading plans>
```
modify step (1): GUI auto-detects existing modalities folder name and shows them in GUI (step a.). Then (step b.) clinician creates a set modality names he wants to use. Modality names come from a combination of:
- list of pre-allowed modalities (hardcoded in MedicalFolder(Base,Controller)) shown in drop-down menu (eg: [T1] [T2] [T3] [labels] [genomics] etc.)
- other modality names entered by clinician in text field (add modality). Only uppercase letters and digits are allowed to limit modality name mismatch between nodes.
- Then clinician associates detected to retained modalities. For example, they can drag-drop modalities (and we show association), or have a drop-down menu next to each detected modality. Each detected modality shall be associated to a retained modality before we can validate and proceed to next step.
```
a. Detected modalities: label] [T1siemens] [T2philips] [T2hitachi] [T1philips]
b. Retained modalities: <pre-allowed modalities> (add modality)
[T1] [T2] [mymodality] [label]
```
  If all detected modalities are already pre-allowed modalities, we can ease the process by printing "auto-detected modality setup" and proposing an optional customization of modalities (as above).

[name=Francesco] There is a bit of confusion here between the DataLoadingPlan as a generic concept, and the specific DataPipeline that allows to map the detected and the retained modalities. Should we plan also a more "generic" interface, that allows us in the future to define new DataPipelines and allows the clinician to specify details through the GUI?

Second part: mapping subject indices

The goal is to match a csv file row (subject in demographics file) with a subject folder name in the MedicalFolderDataset.

The current approach with FOLDER_NAME in the csv file column's fields does not correspond to what the hospitals can implement. We need to ask the clinicians what is the criterion to match the identifier to the folder name.

The proposed format support is:

all subject folder names from a dataset start with the same string, followed by a unique numerical ID for the subject (eg: "IXI00123", "IXI456" - but not "IXI_789" then)
the unique numerical IDs correspond to one column in the csv. IDs in the folder name and csv must be numerically equal (eg "IXI00123" matches value 123 in the csv, even if leading zeros exist)
Note: at this point we don't consider longitudinal data (raise error if multiple subject folders exist for the same subject index).

GUI implementation workflow

Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":

modify step (3): automatically scan the csv file columns to check if one colums matches the supported format. If 1 or more columns support it, print a success message "successfully found x matching columns [column title tags] (+ if more than 1 column have clinician select one of the columns). If no column support it, print an error message ("no matching column found"). In all cases, show top lines of the csv files (as already done).

Third part: Providing good default values

The goal is to provide default values when it is possible, to simplify the clinician's "Add Dataset" process for MedicalDatasetFolder

for csv file (eg participants.csv)
Note: if final choice for mapping subject indices (see "Part two") does not provide a default value, add one here.

GUI implementation workflow

Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":

modify step (2): If "Use only subject ..." is un-checked, check if default csv exists. If it exists, print success message (eg "Found default reference/demographics csv"), make "Customize reference/demographics csv" optional and unchecked by default. If it does not exist, "Customize reference/demographics csv" is mandatory and checked. In all cases, <<Select Data File>> permits choosing a custom csv (same as current implementation).

[ ] Use only subject folders for MedicalFolder dataset
  "Found default reference/demographics csv"
  (x) Customize reference/demographics csv
    <<Select Data File>>

Projected future extension: researcher side data loading plan

The context is that the expected format details for a dataset are specific to an experiment. The researcher is the one who better knows how the training plan works, thus which data format is expected. Thus enabling the researcher to send in-application requests for dataset formatting enhances data setup (vs pure node-side dataset customization) by avoiding out-of-bound and duplicate clinician-researcher communication about dataset customization.

In addition to node-side DataLoadingPlan: as a researcher i want to request nodes sharing a dataset for an experiment to follow some rules when sharing the dataset, so that the shared data are more homogeneous among nodes and setting an experiment is easier/less error prone.

This means that these parameters will be defined only once by the researcher. But each node keeps control (can accept or not to follow researcher's request).

Some similarity exists with the model approval process.

Draft worflow:

on the researcher side, before creating an experiment, the researcher requests nodes to follow a data loading plan for sharing a specific dataset
```
# on researcher side
```

eg: set the modalities we want to be announced by the nodes. Matching with dataset folder names will be

added on the node side

requirements = { 'T1': [], 'T2': [], 'label': [] }

choose specific tags for datasets shared using these requirements

tags = [ #specific #for #this #rdlp ]

apply to a dataset type

data_type = 'medical-folder'

optionally restrict request to some nodes (default: broadcast)

nodes = [ node1_id, node2_id ]

request nodes

rdlp = ResearcherDataLoadingPlan(requirements, data_type, tags, nodes) rdlp.submit_loading_plan("my message")

reminder: all this needs to be done before creating experiment, where we want datasets to be shared with specified `tags`

exp = Experiment(..., tags)

- on the node side, the node is added to the database table of `DataLoadingPlans` (DLP)
    - add extra field to the `DataLoadingPlan`: `status`, which is `Approved` by default for locally registered/created plans, and `Pending` by default for plans requested by the researcher
    - plans are managed same as models: approve, reject, delete.
    - discussion: other extra fields to `DataLoadingPlan` ? (idea is to make the workflow less error prone and easier for the clinician by avoiding unwanted modifications by clinician when adding dataset. Node-side always keeps the last word, as it can refuse to Approve the DLP or later Reject it)
        - `mandatory` (or `default`): if this DLP is approved and `default == True` then it is loaded by default (resp: mandatory) when adding a dataset of the same data type
        - `modifiable`: if this  DLP is approved and `modifiable== False`, then the parameters set in the DLP cannot be overwritten when adding dataset.
- on the node side, nodes willing to share the data with DLP do it:
    - use `tags` from the DLP to set dataset tags (implemented as a `DataPipeline` ?) 
    - requires GUI/CLI adaptation/extension (eg: DLP with detected modalities but no retained modalities yet, `modifiable` field, etc.)
- on the researcher side, experiment can now be created using dataset customized with DLP

on the researcher side

check how nodes handled request eg

- is RDLP approved/pending/rejected ?

- query dataset status to see if datasets are shared with `tags`

rdlp.check_status()

now ready to start experiment

reminder: use same `tags` as for RDLP

exp = Experiment(..., tags)

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43

marked this issue as related to #340

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43

marked this issue as related to #341

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43

marked this issue as related to #321

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43

marked this issue as related to #339

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44

marked this issue as related to #320

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44

marked this issue as related to #319

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44

marked this issue as related to #318

fedbiomed / fedbiomed