fedbiomed / fedbiomed

A collaborative learning framework for empowering biomedical research
https://fedbiomed.org
Other
36 stars 4 forks source link

Clinician customization of dataset at add time #360

Open srcansiz opened 2 years ago

srcansiz commented 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:42

(was SP17 - Item 01)

As a clinician, i want to be able to easily (GUI or CLI, no coding) customize a dataset when adding (sharing) it on the node, so as to present a more homogeneous interface to the researcher by hiding certain specificities of my node's environment/file system/file structure/etc..

As a clinician, i want to be able to re-use dataset customizations previously defined.

As a developer, i want the implementation to be generic for all datasets and all types of add-time customizations, and backward compatible with existing dataset implementations.

Implementation requirement:

Tasks:

Future extension - fully specify and implement researcher side data loading plan

Add node side DataLoadingPlan (DLP)

The goal of this task is to implement DataLoadingPlan (DLP) mechanism in the node. DataLoadingPlan is currently a pure node side notion (no modification of the researcher side or control on the researcher side at this point).

Note: same dataset can already be shared multiple times simultaneously (eg using different DLP), as long as different dataset tags are used.

Create classes DataLoadingPlan and DataPipeline

This paragraph describes the projected implementation for DLP mechanism. The final implementation may deviate from the projected implementation, but please discuss any proposed deviations with the team first.

[name=Francesco] I think we should also address the question: "Why don't we simply make a small change to the MedicalFolderDataset class?" One answer could be that we are trying to introduce a more general framework, but the reality is that we are only using it for MedicalFolderDataset for the moment. In my opinion, it is worth to discuss with the rest of the team if they are comfortable with this relatively more complex implementation.

Example of database entry of a DLP (with the ModalitiesDP() containing modality association data):

    {
    'dlp_id': my_unique_id,
    'dlp_name': my_optional_name,
    'dlp_data_type': my_str_or_enum_data_type,
    'pipelines': [
        {
            'pipeline_class': 'ModalitiesDP',
            'data': {
                'T1': [ 'T1siemens', 'T1philips' ],
                'T2': [ 'T2hitachi', 'T2other' ],
                'label': [ 'label' ]
            }
        },
        ...
    ]
    }
GUI extension for DLP

Projected implementation: node GUI is extended with a new menu (left side bar) for handling DLPs:

Same functionality is added to the CLI.

create MedicalFolderDataset customizations

The goal is to use the DataLoadingPlan (DLP) mechanism to:

DLP replaces the current adhoc dataset_parameters mechanism introduced

First part: mapping data modalities

The goal is to map data modalities with folders having different naming patterns to the same naming convention.

So this task focuses on the subfolders of the MedicalFolderDataset's subject folders (== the folders containing the imaging data for the different modalities). Basically, we considered the folder name to be the modality, but this is not flexible enough.

The assumption is that multiple naming patterns for the subfolders may correspond to the same modality. However, we assume that there will not be arbitrarily many naming patterns for a single modality (max 10).

We need to provide the node a way to link different "modality folders" names to the names used to identify data modalities

Providing a modality mapping is mandatory (do not permit to share a dataset without specifying it).

GUI implementation worklow

Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":

[name=Francesco] There is a bit of confusion here between the DataLoadingPlan as a generic concept, and the specific DataPipeline that allows to map the detected and the retained modalities. Should we plan also a more "generic" interface, that allows us in the future to define new DataPipelines and allows the clinician to specify details through the GUI?

Second part: mapping subject indices

The goal is to match a csv file row (subject in demographics file) with a subject folder name in the MedicalFolderDataset.

The current approach with FOLDER_NAME in the csv file column's fields does not correspond to what the hospitals can implement. We need to ask the clinicians what is the criterion to match the identifier to the folder name.

The proposed format support is:

GUI implementation workflow

Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":

Third part: Providing good default values

The goal is to provide default values when it is possible, to simplify the clinician's "Add Dataset" process for MedicalDatasetFolder

GUI implementation workflow

Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":

[ ] Use only subject folders for MedicalFolder dataset
  "Found default reference/demographics csv"
  (x) Customize reference/demographics csv
    <<Select Data File>>

Projected future extension: researcher side data loading plan

The context is that the expected format details for a dataset are specific to an experiment. The researcher is the one who better knows how the training plan works, thus which data format is expected. Thus enabling the researcher to send in-application requests for dataset formatting enhances data setup (vs pure node-side dataset customization) by avoiding out-of-bound and duplicate clinician-researcher communication about dataset customization.

In addition to node-side DataLoadingPlan: as a researcher i want to request nodes sharing a dataset for an experiment to follow some rules when sharing the dataset, so that the shared data are more homogeneous among nodes and setting an experiment is easier/less error prone.

This means that these parameters will be defined only once by the researcher. But each node keeps control (can accept or not to follow researcher's request).

Some similarity exists with the model approval process.

Draft worflow:

eg: set the modalities we want to be announced by the nodes. Matching with dataset folder names will be

added on the node side

requirements = { 'T1': [], 'T2': [], 'label': [] }

choose specific tags for datasets shared using these requirements

tags = [ #specific #for #this #rdlp ]

apply to a dataset type

data_type = 'medical-folder'

optionally restrict request to some nodes (default: broadcast)

nodes = [ node1_id, node2_id ]

request nodes

rdlp = ResearcherDataLoadingPlan(requirements, data_type, tags, nodes) rdlp.submit_loading_plan("my message")

reminder: all this needs to be done before creating experiment, where we want datasets to be shared with specified tags

exp = Experiment(..., tags)

- on the node side, the node is added to the database table of `DataLoadingPlans` (DLP)
    - add extra field to the `DataLoadingPlan`: `status`, which is `Approved` by default for locally registered/created plans, and `Pending` by default for plans requested by the researcher
    - plans are managed same as models: approve, reject, delete.
    - discussion: other extra fields to `DataLoadingPlan` ? (idea is to make the workflow less error prone and easier for the clinician by avoiding unwanted modifications by clinician when adding dataset. Node-side always keeps the last word, as it can refuse to Approve the DLP or later Reject it)
        - `mandatory` (or `default`): if this DLP is approved and `default == True` then it is loaded by default (resp: mandatory) when adding a dataset of the same data type
        - `modifiable`: if this  DLP is approved and `modifiable== False`, then the parameters set in the DLP cannot be overwritten when adding dataset.
- on the node side, nodes willing to share the data with DLP do it:
    - use `tags` from the DLP to set dataset tags (implemented as a `DataPipeline` ?) 
    - requires GUI/CLI adaptation/extension (eg: DLP with detected modalities but no retained modalities yet, `modifiable` field, etc.)
- on the researcher side, experiment can now be created using dataset customized with DLP

on the researcher side

check how nodes handled request eg

- is RDLP approved/pending/rejected ?

- query dataset status to see if datasets are shared with tags

rdlp.check_status()

now ready to start experiment

reminder: use same tags as for RDLP

exp = Experiment(..., tags)

srcansiz commented 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43

marked this issue as related to #340

srcansiz commented 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43

marked this issue as related to #341

srcansiz commented 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43

marked this issue as related to #321

srcansiz commented 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43

marked this issue as related to #339

srcansiz commented 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44

marked this issue as related to #320

srcansiz commented 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44

marked this issue as related to #319

srcansiz commented 2 years ago

In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44

marked this issue as related to #318