Open srcansiz opened 2 years ago
In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43
marked this issue as related to #340
In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43
marked this issue as related to #341
In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43
marked this issue as related to #321
In GitLab by @ErwanDemairy on Sep 16, 2022, 16:43
marked this issue as related to #339
In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44
marked this issue as related to #320
In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44
marked this issue as related to #319
In GitLab by @ErwanDemairy on Sep 16, 2022, 16:44
marked this issue as related to #318
In GitLab by @ErwanDemairy on Sep 16, 2022, 16:42
(was SP17 - Item 01)
As a clinician, i want to be able to easily (GUI or CLI, no coding) customize a dataset when adding (sharing) it on the node, so as to present a more homogeneous interface to the researcher by hiding certain specificities of my node's environment/file system/file structure/etc..
As a clinician, i want to be able to re-use dataset customizations previously defined.
As a developer, i want the implementation to be generic for all datasets and all types of add-time customizations, and backward compatible with existing dataset implementations.
Implementation requirement:
DataLoader
(contributes to genericity)DataLoadingPlan
Tasks:
add node side DataLoadingPlan[ ] introduceissue #318DataLoadingPlan
,DataPipeline
, DLP table + adapt training-time code[ ] GUI extension for DLPissue #320[ ] CLI extension for DLPissue #320create MedicalFolderDataset customizations for GUI and CLI
[ ] mapping data modalitiesissue #319 #320[ ] mapping subject indicesissue #321[ ] providing good default valuesissue #321[ ] clean out MedicalFolderDataset from
dataset_parameters
mechanism (complete replaced by DLP)[ ] nice to have: update dataset implementations to use
DataLoadingPlan
if useful instead of current adhoc solutions (eg: CSV dataset'sdtypes
?)[ ] nice to have: GUI permits export of a csv template (only the headers/column names) from an already loaded MedicalFolderDataset
[ ] page for DLP view and delete (replaces #341)
[ ] mapping subject indices and providing good default values in
MedicalFolderDataset
(replaces #321)Future extension - fully specify and implement researcher side data loading plan
Add node side DataLoadingPlan (DLP)
The goal of this task is to implement
DataLoadingPlan
(DLP) mechanism in the node.DataLoadingPlan
is currently a pure node side notion (no modification of the researcher side or control on the researcher side at this point).Note: same dataset can already be shared multiple times simultaneously (eg using different DLP), as long as different dataset tags are used.
Create classes
DataLoadingPlan
andDataPipeline
This paragraph describes the projected implementation for DLP mechanism. The final implementation may deviate from the projected implementation, but please discuss any proposed deviations with the team first.
introduce a
DataLoadingPlan
class. This class describes the customizations applied to a dataset for a specific "add" of the dataset. It includes:a unique ID identifying this DLP
an arbitrary optional name describing the DLP
the type of data this DLP can be applied to. Values are same as
dataset['data_type']
, so type is currently astr
but would be cleaner implementation introducing an enum type.a list of
DataPipelines
functions for saving/loading the DLP to/from the database
def add_pipeline(dp: DataPipeline): self._pipelines.append(dp)
def save(): ... for p in self._pipelines: p.save() def load(): ...
... # need to define all needed functions for class
introduce a
DataPipeline
class. A data pipeline describes an intermediary layer between the researcher and the file structure on the node. It serves as a way for the node to pre-define (at add time) a layer of indirection between what the researcher is able to manipulate and what actually exists on the node. It is usually (always ?) used through a daughter class (eg:ModalitiesDP
).A data pipeline always contains:
DataLoadingPlan
save/load)Daughter data pipeline class may add more data structure and/or functions related to this specific pipeline.
introduce database support for DLP: use the same tinyDB as for datasets and models, create a dedicated
data_loading_plans
table for DLPs.:::warning :warning: When saving a data pipeline to database, the class name for the pipeline must also be saved (when loading the pipeline, we must know which class to instantiate)
Example of database entry of a DLP (with the
ModalitiesDP()
containing modality association data):info: This item explains the principle of interaction between
DataLoadingPlan
and dataset classes. No code/action is included in this item, implementation is covered by later tasks in this milestone.No modication of the existing datasets code is needed, as long as they don't use DLPs.
No modification is needed on the researcher side (training plan's code for instantiating the dataset), whether or not the dataset uses DLP. For example, for the medical image segmentation notebook the instantiation of
MedicalFolderDataset
and its usage are unchanged:When a dataset wants to use a DLP it needs to have a
set_dlp()
function. The function is used for attaching a data loading plan to the dataset. Example:A dataset that wants to use a DLP must also be modified so that it interprets the content of the DLP:
dlp.data_type()
does not match dataset's type.MedicalFolderDataset
must know what to do with aModalitiesDP
). Ignore (with warning message ?) or fail when a pipeline does not match with the dataset.MedicalFolderDataset
andModalitiesDP
theMedicalFolderDataset.__getitem__()
must map the modalities, filter out samples that don't have all modalities after mapping, etc.A dataset that wants to use a DLP must also have:
MedicalFolderDataset
)the dataset code behind the GUI/CLI "Add" (eg:
MedicalFolderController
for medical folder dataset) modified to create a DLP. Example:introduce support for DLP at training/testing time.
Updating node's
Round
with the following seems enough (to be confirmed):GUI extension for DLP
Projected implementation: node GUI is extended with a new menu (left side bar) for handling DLPs:
Delete
button permits to remove a DLPSame functionality is added to the CLI.
create MedicalFolderDataset customizations
The goal is to use the
DataLoadingPlan
(DLP) mechanism to:MedicalFolderDataset
when sharing it through the GUI or the CLIDLP replaces the current adhoc
dataset_parameters
mechanism introducedFirst part: mapping data modalities
The goal is to map data modalities with folders having different naming patterns to the same naming convention.
So this task focuses on the subfolders of the MedicalFolderDataset's subject folders (== the folders containing the imaging data for the different modalities). Basically, we considered the folder name to be the modality, but this is not flexible enough.
The assumption is that multiple naming patterns for the subfolders may correspond to the same modality. However, we assume that there will not be arbitrarily many naming patterns for a single modality (max 10).
We need to provide the node a way to link different "modality folders" names to the names used to identify data modalities
Providing a modality mapping is mandatory (do not permit to share a dataset without specifying it).
GUI implementation worklow
Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":
(x)
is an optional check box. If clicked,[data loading plans]
drop down menu appears which lists saved data loading plans for samedataset[data_type]
. If no such saved loading plans, don't show menu ("no saved loading plans for this data type"). When clinician selects one data loading plan, apply this data loading plan's parameters to the current GUI page/workflow.a.
). Then (stepb.
) clinician creates a set modality names he wants to use. Modality names come from a combination of:MedicalFolder(Base,Controller)
) shown in drop-down menu (eg: [T1] [T2] [T3] [labels] [genomics] etc.)add modality
). Only uppercase letters and digits are allowed to limit modality name mismatch between nodes.If all detected modalities are already pre-allowed modalities, we can ease the process by printing "auto-detected modality setup" and proposing an optional customization of modalities (as above).
Second part: mapping subject indices
The goal is to match a csv file row (subject in demographics file) with a subject folder name in the MedicalFolderDataset.
The current approach with
FOLDER_NAME
in the csv file column's fields does not correspond to what the hospitals can implement. We need to ask the clinicians what is the criterion to match the identifier to the folder name.The proposed format support is:
GUI implementation workflow
Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":
[column title tags]
(+ if more than 1 column have clinician select one of the columns). If no column support it, print an error message ("no matching column found"). In all cases, show top lines of the csv files (as already done).Third part: Providing good default values
The goal is to provide default values when it is possible, to simplify the clinician's "Add Dataset" process for
MedicalDatasetFolder
participants.csv
)GUI implementation workflow
Projected implementation: GUI workflow modifications for "Add new dataset > Medical Folder Dataset":
<<Select Data File>>
permits choosing a custom csv (same as current implementation).Projected future extension: researcher side data loading plan
The context is that the expected format details for a dataset are specific to an experiment. The researcher is the one who better knows how the training plan works, thus which data format is expected. Thus enabling the researcher to send in-application requests for dataset formatting enhances data setup (vs pure node-side dataset customization) by avoiding out-of-bound and duplicate clinician-researcher communication about dataset customization.
In addition to node-side
DataLoadingPlan
: as a researcher i want to request nodes sharing a dataset for an experiment to follow some rules when sharing the dataset, so that the shared data are more homogeneous among nodes and setting an experiment is easier/less error prone.This means that these parameters will be defined only once by the researcher. But each node keeps control (can accept or not to follow researcher's request).
Some similarity exists with the model approval process.
Draft worflow:
eg: set the modalities we want to be announced by the nodes. Matching with dataset folder names will be
added on the node side
requirements = { 'T1': [], 'T2': [], 'label': [] }
choose specific tags for datasets shared using these requirements
tags = [ #specific #for #this #rdlp ]
apply to a dataset type
data_type = 'medical-folder'
optionally restrict request to some nodes (default: broadcast)
nodes = [ node1_id, node2_id ]
request nodes
rdlp = ResearcherDataLoadingPlan(requirements, data_type, tags, nodes) rdlp.submit_loading_plan("my message")
reminder: all this needs to be done before creating experiment, where we want datasets to be shared with specified
tags
exp = Experiment(..., tags)
on the researcher side
check how nodes handled request eg
- is RDLP approved/pending/rejected ?
- query dataset status to see if datasets are shared with
tags
rdlp.check_status()
now ready to start experiment
reminder: use same
tags
as for RDLPexp = Experiment(..., tags)