Cannot run final stage of cascade

MIC-DKFZ / nnUNet

Apache License 2.0

5.83k stars 1.75k forks source link

Cannot run final stage of cascade #9

Closed JunMa11 closed 5 years ago

JunMa11 commented 5 years ago

Dear Fabian,

Thanks for the great respo.

I want to use 3D U-Net Cascade. Firstly, I run

for Fold in [0,4] python run/run_training.py 3d_lowres nnUNetTrainer TaskXX_MY_DATASET FOLD --ndet

and nnUNet generates predictions of validation dataset in each folder.

Markdown

Then, I run

python run/run_training.py 3d_cascade_fullres nnUNetTrainerCascadeFullRes TaskXX_MY_DATASET 0 --ndet

but I get following error

###############################################
Traceback (most recent call last):
  File "run/run_training.py", line 90, in <module>
    batch_dice=batch_dice, stage=stage, unpack_data=unpack, deterministic=deterministic)
  File "/home/jma/Code/nnUNet/nnunet/training/network_training/nnUNetTrainerCascadeFullRes.py", line 31, in __init__
    "Cannot run final stage of cascade. Run corresponding 3d_lowres first and predict the "
RuntimeError: Cannot run final stage of cascade. Run corresponding 3d_lowres first and predict the segmentations for the next stage

I'm confused about the error, because the segmentations have been generated automatically during the 3d_lowres step (in validation folder).

Could you give some insights on this error?

Looking forward to your reply. Best, Jun

JunMa11 commented 5 years ago

Following is the full log

Please cite the following paper when using nnUNet:

Isensee, Fabian, et al. "nnU-Net: Breaking the Spell on Successful Medical Image Segmentation." arXiv preprint arXiv:1904.08128 (2019).

If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet
###############################################
I am running the following nnUNet: 3d_cascade_fullres
My trainer class is:  <class 'nnunet.training.network_training.nnUNetTrainerCascadeFullRes.nnUNetTrainerCascadeFullRes'>
For that I will be using the following configuration:
num_classes:  2
modalities:  {0: 'MR'}
use_mask_for_norm OrderedDict([(0, False)])
keep_only_largest_region OrderedDict([((1, 2), False), ((2,), False), ((1,), False)])
min_region_size_per_class OrderedDict([(1, 0.3988037109375), (2, 78.30931661574891)])
min_size_per_class OrderedDict([(1, 404322.9441427806), (2, 160.876152621963)])
normalization_schemes OrderedDict([(0, 'nonCT')])
stages...

stage:  0
{'batch_size': 2, 'num_pool_per_axis': [3, 5, 5], 'patch_size': array([ 48, 192, 192]), 'median_patient_size_in_voxels': array([ 81, 297, 324]), 'current_spacing': array([2.5       , 1.11122066, 1.11122066]), 'original_spacing': array([2.5       , 0.70310003, 0.70310003]), 'do_dummy_2D_data_aug': True, 'pool_op_kernel_sizes': [[1, 2, 2], [1, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[1, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

stage:  1
{'batch_size': 2, 'num_pool_per_axis': [3, 5, 5], 'patch_size': array([ 32, 224, 224]), 'median_patient_size_in_voxels': array([ 81, 469, 512]), 'current_spacing': array([2.5       , 0.70310003, 0.70310003]), 'original_spacing': array([2.5       , 0.70310003, 0.70310003]), 'do_dummy_2D_data_aug': True, 'pool_op_kernel_sizes': [[1, 2, 2], [1, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[1, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

I am using stage 1 from these plans
I am using batch dice + CE loss

I am using data from this folder:  /home/jma/scratch/data/pre_data/TaskXX_MY_DATASET/nnUNet
###############################################
Traceback (most recent call last):
  File "run/run_training.py", line 90, in <module>
    batch_dice=batch_dice, stage=stage, unpack_data=unpack, deterministic=deterministic)
  File "/home/jma/Code/nnUNet/nnunet/training/network_training/nnUNetTrainerCascadeFullRes.py", line 31, in __init__
    "Cannot run final stage of cascade. Run corresponding 3d_lowres first and predict the "
RuntimeError: Cannot run final stage of cascade. Run corresponding 3d_lowres first and predict the segmentations for the next stage

FabianIsensee commented 5 years ago

Hi Jun,

        if network == '3d_lowres':
            trainer.load_best_checkpoint(False)
            print("predicting segmentations for the next stage of the cascade")
            predict_next_stage(trainer, join(dataset_directory, trainer.plans['data_identifier'] + "_stage%d" % 1))

this is an excerpt of run_training.py. It appears at the very bottom of the script. As you can see, predict_next_stage is called if your network is '3d_lowres'. So the segmentations should have been created. The predictions of the validation set are however not what is used for the next stage of the cascade. There should be another folder "segs_from_prev_stage" (or similar) in the folder where the "fold_X" subfolders are. I don't know why it is missing. Could you please run

python run/run_training.py 3d_lowres nnUNetTrainer TaskXX_MY_DATASET 0 --ndet -val

and tell me what the output is? Also please check if the missing folder will be created. Best, Fabian

JunMa11 commented 5 years ago

Hi Fabian, Thanks for your quick reply.

I run 3 different folds, but all of them suffer from the following error

train_704
separate z: True lowres axis [0]
separate z
train_704 (2, 84, 315, 315)
debug: mirroring True mirror_axes (0, 1, 2)
train_828
separate z: False lowres axis None
train_828 (2, 83, 324, 324)
debug: mirroring True mirror_axes (0, 1, 2)
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/jma/anaconda3/envs/torch10/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/jma/anaconda3/envs/torch10/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/PytorchCode/nnUNet0501/nnunet/inference/segmentation_export.py", line 100, in save_segmentation_nifti_from_softmax
    bbox[c][1] = np.min((bbox[c][0] + seg_old_spacing.shape[c], shape_original_before_cropping[c]))
TypeError: 'int' object is not subscriptable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run/run_training.py", line 111, in <module>
separate z: False lowres axis None
    trainer.validate(save_softmax=args.npz, validation_folder_name=val_folder)
  File "/jaylabs/amartel_data2/liver_MRI/GadSurgical/LiverTumorSeg/PytorchCode/nnUNet0501/nnunet/training/network_training/nnUNetTrainer.py", line 497, in validate
    _ = [i.get() for i in results]
  File "/PytorchCode/nnUNet0501/nnunet/training/network_training/nnUNetTrainer.py", line 497, in <listcomp>
    _ = [i.get() for i in results]
  File "/home/jma/anaconda3/envs/torch10/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
TypeError: 'int' object is not subscriptable

FabianIsensee commented 5 years ago

Seems like your .pkl that accompanies the preprocessed data is corrupted. I am not in the office right now. It would help me a lot if you could send me the file. If should be located in the folder where your preprocessed data is (train_828.pkl or something similar). (f.isensee at dkfz.de) You can also try to run the preprocessing again. Best, Fabian

JunMa11 commented 5 years ago

Hi @FabianIsensee ,

Thanks for your help. I sent you the email with train_828.pkl.

Meanwhile, I re-run the plan_and_preprocess_task.py, then

python run/run_training.py 3d_lowres nnUNetTrainer TaskXX_MY_DATASET 0 -val --ndet

but the same error occurred again. train_649.pkl can be downloaded here.

train_629
separate z: False lowres axis None
train_629 (2, 88, 360, 360)
debug: mirroring True mirror_axes (0, 1, 2)
train_649
separate z: True lowres axis [0]
separate z
train_649 (2, 103, 396, 396)
debug: mirroring True mirror_axes (0, 1, 2)
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/jma/anaconda3/envs/torch10/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/jma/anaconda3/envs/torch10/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/nnunet/inference/segmentation_export.py", line 100, in save_segmentation_nifti_from_softmax
    bbox[c][1] = np.min((bbox[c][0] + seg_old_spacing.shape[c], shape_original_before_cropping[c]))
TypeError: 'int' object is not subscriptable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run/run_training.py", line 111, in <module>
    trainer.validate(save_softmax=args.npz, validation_folder_name=val_folder)
  File "/nnunet/training/network_training/nnUNetTrainer.py", line 497, in validate
    _ = [i.get() for i in results]
  File "/nnunet/training/network_training/nnUNetTrainer.py", line 497, in <listcomp>
    _ = [i.get() for i in results]
  File "/home/jma/anaconda3/envs/torch10/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
TypeError: 'int' object is not subscriptable

I also check the original nii.gz files in /imagesTr and /labelsTr with the following code:

path = ''
names = os.listdir(path)
names.sort()

for name in names:
    data = nb.load(os.path.join(path, name)).get_data()
    print(name, data.shape)

All the files are ok.

FabianIsensee commented 5 years ago

Hi there, I apologize, I may have not chosen the right word. By corrupted I meant that the pkl file seems wrong and I could now confirm that based on the files you sent me:

from batchgenerators.utilities.file_and_folder_operations import *
a = load_pickle('train_643.pkl')
print(a['crop_bbox'])

(1, 59, 512, 512)

The output is supposed to look different. I deleted my files from the Liver task and reran the cropping and this is what it looks like:

from batchgenerators.utilities.file_and_folder_operations import *
a = load_pickle('liver_22.pkl')
print(a['crop_bbox'])

[[0, 247], [0, 512], [0, 512]]

My previous comment about rerunning the preprocessing was incorrect. You should try to rerun the data cropping. Delete the folder that belongs to your task in nnUNet_raw_cropped and rerun preprocessing. You can then check with the code snipped above whether it worked. If it still does now work, have a look at your Task03_Liver data. Do they have the same error? If not - what is different in your data? Is there maybe a 2D Dataset in there? Or is train_643 4D (which it should not be!!)? (You should check for 3D/4D in the nnUNet_raw_splitted folder as this is where the cropping takes the data from). Best, Fabian

FabianIsensee commented 5 years ago

Just as a side note: On the Task03_Liver the cascade did not do so well. Not quite sure why. What I show here is average foreground dice (so the mean dice of liver and tumor) from the 5 fold cross-validation. You may not have to run the cascade to get the best results

2d 0.7345 3d_cascade_fullres 0.74 3d_fullres 0.7686 3d_lowres 0.7314

JunMa11 commented 5 years ago

Hi @FabianIsensee ,

Thanks for your help. I rerun the preprocessing, and it works Now. A new folder named pred_next_stage is generated. I have no idea why the previous 'train_643.pkl' has wrong 'crop_bbox'. Anyway, it works well now.

For the lower performance of cascade in LiTS task, in your Decathlon challenge paper,

nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation

What is the motivation of using the outputs of lowres-UNet as additional input channels for the second U-Net?

Intuitively, we can generate ROI image based on the 1st U-Net segmentation (ignoring the region outside the segmentation bounding box), then the 2nd U-Net directly segments the ROI image.

For LiTS, we can even only segment the tumor in liver mask. In this way, it can not only reduce computation burden, but also exclude interruptions outside the liver. I guess nnUNet is designed for general segmentation tasks, so it doesn't do this for LiTS.

Best, Jun

JunMa11 commented 5 years ago

A quick question on learning rate setting in transfer learning scenario. My task is also liver tumor segmentation (MR), but I only have a small dataset. So I want to use the well trained model in LiTS dataset and finetune in my small dataset.

Do I need to reduce the initial_lr in nnUNetTrainer.py (eg. reduce to 3e-5)? Or nnU-Net will automatically adjust lr to accord with current training.

FabianIsensee commented 5 years ago

What is the motivation of using the outputs of lowres-UNet as additional input channels for the second U-Net?

The motivation is that the patch size for 3d_fullres may be too small to capture sufficient contextual information for the UNet to properly segment the target straucture. By usind 3d_lowres we guarantee that enough contextual information is captured, at the cost of rediced spatial resolution. The second stage of the cascade is intended to refine these segmentations.

Intuitively, we can generate ROI image based on the 1st U-Net segmentation (ignoring the region outside the segmentation bounding box), then the 2nd U-Net directly segments the ROI image.

This is a very sensible thing to do and in fact something we could/should have done. Indeed I am thinking about implementing a mix of the two. Doing solely what you suggested may not be ideal if the target structures are distributed all across the images (and not just a specific target organ fro example)

For LiTS, we can even only segment the tumor in liver mask. In this way, it can not only reduce computation burden, but also exclude interruptions outside the liver.

If there is a hierarchy to the labels then this is definitely worth doing, see my BraTS2018 paper. But as you said, nnunet is intended to be general purpose and we don't know about label hierarchies

Do I need to reduce the initial_lr in nnUNetTrainer.py (eg. reduce to 3e-5)? Or nnU-Net will automatically adjust lr to accord with current training.

I have no experience with fine tuning, you need to figure that our yourself, sorry. nnU-net will however deacrease the learning rate automatically if it does not detect an improvement within recent epochs. So the training may be shorter. But really fine tuning involves a lot more than just the learning rate I think. Some people like warm starts, some decrease the learning rate. There is also a variety of learning rate schedules for that. Honestly, I don't know

Hope this helps, Best, Fabian

JunMa11 commented 5 years ago

Hi Fabian,

Got it. Thanks for your answer very much.

Best, Jun

sbajpai2 commented 4 years ago

Hi Jun,

You can look into Models Genesis developed by our lab. We provide pre-trained weights for nnUNet framework. (Transfer Learning) (https://github.com/MrGiovanni/ModelsGenesis/tree/master/competition)

Best, Shivam