MIC-DKFZ / nnUNet

Apache License 2.0
5.91k stars 1.76k forks source link

How to create custom dataset with different modalities #1986

Closed sabinevater closed 6 months ago

sabinevater commented 8 months ago

Dear team,

thank you very much for developing nnUNet!

We are currently working on assembling a dataset of MRI images similar to the Task01 (Brain Tumor) of the Medical Segmentation Decathlon. We noticed that in your documentation for this dataset each input channel of an MRI image has its separate .nii.gz - file (see https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/dataset_format.md#dataset-folder-structure ) . However, when we downloaded the dataset, only one file for all 4 modalities was present (i.e. files were named "BRATS_001.nii.gz" , "BRATS_002.nii.gz" etc).

Do you by chance have any information on how this was "restructured" ? And a related question: if we have images with varying numbers of input channels/modalities (e.g. some have the channels FLAIR and T1w and some only the channel MRI), is this something nnUNet can handle during training?

Kind regards and a great week!

Shrajan commented 8 months ago

Dear @sabinevater,

I hope you are doing well!

The script to convert MSD datasets to the nnUNet format can be found here.

MSD datasets with 3D images, such as Task02_Heart, Task03_Liver, etc., are copied and renamed (adding '_0000' to the base name of the image files) using line 19. An example with Task02_Heart ("la_003_0000.nii.gz", "la_004_0000.nii.gz", etc.) is mentioned here.

However, since the Task01_BrainTumour dataset contains images that are 4D tensors (with shapes H × W × D × Cin), each channel/modality (Cin) is extracted and saved as a 3D image. This is done in lines 24-38. Here, the 3D images will be stored with the same spacings as the original 4D image in HWD dimensions. For example, if "BRATS_001.nii.gz" has a spacing of 0.5 × 0.5 × 0.5 × 1.0 (along HWDCin), then "BRATS_001_0000.nii.gz", "BRATS_001_0001.nii.gz", "BRATS_001_0002.nii.gz", and "BRATS_001_0003.nii.gz" will all have the same spacing of 0.5 × 0.5 × 0.5 (along HWD).

Regarding your question about handling varying numbers of input channels/modalities within the same dataset, it's important to note that nnUNet does not permit this. In fact, if you run the command plan_and_preprocess with the --verify_dataset_integrity flag, it will raise an error. Moreover, training with images having varying input channels might be counter-productive.

I hope this information is helpful for you.

sabinevater commented 8 months ago

Dear @Shrajan,

thank you very much for your feedback and all the information, I appreciate it.

Just for clarification: If e.g. the file "BRATS_001.nii.gz" is 'cut' into "BRATS_001_0000.nii.gz", "BRATS_001_0001.nii.gz" etc each of these files only have one modality in my understanding. If the next files would only be "BRATS_002_0000" and "BRATS_003_0000" with "BRATS_002_0001.nii.gz" and "BRATS_003_0001.nii.gz" missing I thought the training might work?

Kind regards

Shrajan commented 8 months ago

Dear @sabinevater

I am glad to hear it.

I am sorry, but the training will not work with your setting. All the training samples should have the same number of modalities/channels as listed in the dataset.json file. Otherwise, nnUNet will not continue ahead.

You could try it yourself: (1) download any dataset from the MSD website that has mutliple channels - maybe a smaller one like Task05_Prostate, (2) convert it to the nnUNet format, (3) remove some of the channel files from the training folder, and (4) run the command plan_and_preprocess with the --verify_dataset_integrity flag. The execution will stop with errors.