MECLabTUDA / Lifelong-nnUNet

Apache License 2.0
133 stars 16 forks source link

Unable to run LwF on custom segmentation tasks #7

Open adwaykanhere opened 11 months ago

adwaykanhere commented 11 months ago

Hi @amrane99 ,

Thank you for this great library!

I'm interested in testing the LwF strategy using your library. I have two datasets (711, 713) that are in the right format, planned and preprocessed already. When I follow your documentation to run the LwF scheme, I'm able to run training on the first 250 epochs for the first dataset but I get the following error before training on the second dataset

FileNotFoundError: [Errno 2] No such file or directory: '/home/akanhere/nnunet/nnUNet_preprocessed/Task713_Peds_Fine/nnUNetPlansv2.1_plans_3D.pkl'

I've looked at my preprocessed folder for it's contents and I don't have the plans_3d_pkl file in the preprocessed folder at all. Instead I have the following subdirectories (I'm using a custom trainer function)

gt_segmentations/
nnUNetData_plans_v2.1_2D_stage0/                            nnUNetData_pretrained_nnUNetTrainerV3_100epochs__nnUNetPlansv2.1_stage1/
nnUNetData_pretrained_nnUNetTrainerV3_100epochs__nnUNetPlansv2.1_stage0/ 

Please let me know how I can run this setup correctly?

amrane99 commented 11 months ago

Hey @adwaykanhere,

thanks for using our Framework. Looking at the error, the plans file can not be found which will be generated during the plan and preprocessing step of the data. Since you successfully trained on 711, it seems that you either forgot to preprocess the dataset 713 the same way as 711, or the preprocessing just failed due to some discrepancy in your dataset since you said it is a custom dataset. So ensure that your data is properly preprocessed using nnU-Nets plan and preprocess command as mentioned in their training example here: https://github.com/MIC-DKFZ/nnUNet/blob/nnunetv1/documentation/training_example_Hippocampus.md (double check that step 4 works without errors on 711 and 713). To verify that it is not the Lifelong nnU-Net Framework, try to train a simple nnU-Net on 713. You will probably run in the same issue since the plans to build the nnU-Net has not been generated.

I hope this will help to get LwF running :)

Best, Amin

adwaykanhere commented 11 months ago

Hey @amrane99, thanks for your quick response. Yes, running the plan and process command fixed it!

adwaykanhere commented 10 months ago

Hi @amrane99, opening this issue again as you suggested.

The command I'm running for the LwF trainer is

nnUNet_train_lwf 3d_fullres -t 711 713 -f 0 -lwf_temperature 0.75 -num_epoch 250 -d 3 -save_interval 25 -s seg_outputs --store_csv

I have verified that the paths are correctly setup by referring to the README but I can't find the plans.pkl file when I try to run the inference and the error message is as follows

FileNotFoundError: [Errno 2] No such file or directory: '/home/akanhere/nnunet/RESULTS/nnUNet_ext/3d_fullres/Task711_Ped_Fresh_Task713_Peds_Fine/Task713_Peds/nnUNetTrainerLWF__nnUNetPlansv2.1/Generic_UNet/SEQ/plans.pkl'

Thanks so much for your support!

amrane99 commented 10 months ago

Hi, If the code does not find the plans.pkl file of a dataset, then you did not plan and preprocess the datasets correctly using the nnUNet command.

adwaykanhere commented 10 months ago

Hi @amrane99, I double-checked that both the datasets contain the plans files for each in their nnunet preprocessed folder after running the nnUNet_plan_and_preprocess function. I think this issue is pointing to not being able to find the plans file after training using the lwf trainer.

hdnminh commented 10 months ago

Hi @adwaykanhere I have a problem while running: pip install -r requirements.txt. image

Have you run it yet and met any errors?

Thank you in advance!

adwaykanhere commented 10 months ago

@hdnminh try installing pip install scikit-learn instead of sklearn as sklearn in not in PyPi.

adwaykanhere commented 10 months ago

Hi @amrane99 Please help!
UPDATE: I double checked that both tasks are properly planned, preprocessed and ran the command to train the lwf trainer again as follows -

nnUNet_train_lwf 3d_fullres -t 711 713 -f 0 -lwf_temperature 0.75 -num_epoch 250 -d 3 -save_interval 25 -s seg_outputs --store_csv

but this time when saving the model at epoch 25, I get the following error


2023-11-21 18:30:59.526378:
epoch:  24
Epoch 25/250:  70%|████████████████████████████████████████████                   | 175/250 [00:51<00:40,  1.86it/s, loss=0.16811286]case does not contain any foreground classes Pediatric-CT-SEG-402
Epoch 25/250: 100%|██████████████████████████████████████████████████████████████| 250/250 [01:15<00:00,  3.31it/s, loss=0.119534254]
2023-11-21 18:32:15.073188: train loss : 0.3600
2023-11-21 18:32:22.646182: validation loss: 0.3064
2023-11-21 18:32:22.647644: Average global foreground Dice: [0.3833, 0.3305, 0.4023, 0.5125, 0.801, 0.0634, 0.0, 0.2405, 0.1261, 0.0, 0.0, 0.0, 0.0]
2023-11-21 18:32:22.647881: (interpret this as an estimate for the Dice of the different classes. This is not exact.)
2023-11-21 18:32:23.097766: lr: 0.009095
2023-11-21 18:32:23.097904: saving scheduled checkpoint file...
2023-11-21 18:32:23.147022: saving checkpoint...
2023-11-21 18:32:23.862422: done, saving took 0.76 seconds
2023-11-21 18:32:23.869126: done
2023-11-21 18:32:23.896848: saving checkpoint...
2023-11-21 18:32:24.556477: done, saving took 0.69 seconds
Traceback (most recent call last):
  File "/home/akanhere/.conda/envs/llnnunet/bin/nnUNet_train_lwf", line 33, in <module>
    sys.exit(load_entry_point('nnunet-ext', 'console_scripts', 'nnUNet_train_lwf')())
  File "/home/akanhere/llnnunet_inference/Lifelong-nnUNet/nnunet_ext/run/run_training.py", line 942, in main_lwf
    run_training(extension='lwf')
  File "/home/akanhere/llnnunet_inference/Lifelong-nnUNet/nnunet_ext/run/run_training.py", line 851, in run_training
    trainer.run_training(task=t, output_folder=output_folder_name)
  File "/home/akanhere/llnnunet_inference/Lifelong-nnUNet/nnunet_ext/training/network_training/lwf/nnUNetTrainerLWF.py", line 179, in run_training
    ret = super().run_training(task, output_folder)
  File "/home/akanhere/llnnunet_inference/Lifelong-nnUNet/nnunet_ext/training/network_training/multihead/nnUNetTrainerMultiHead.py", line 561, in run_training
    ret = super().run_training()
  File "/home/akanhere/.conda/envs/llnnunet/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainerV2.py", line 440, in run_training
    ret = super().run_training()
  File "/home/akanhere/.conda/envs/llnnunet/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainer.py", line 317, in run_training
    super(nnUNetTrainer, self).run_training()
  File "/home/akanhere/.conda/envs/llnnunet/lib/python3.9/site-packages/nnunet/training/network_training/network_trainer.py", line 484, in run_training
    continue_training = self.on_epoch_end()
  File "/home/akanhere/llnnunet_inference/Lifelong-nnUNet/nnunet_ext/training/network_training/lwf/nnUNetTrainerLWF.py", line 385, in on_epoch_end
    res = super().on_epoch_end()
  File "/home/akanhere/llnnunet_inference/Lifelong-nnUNet/nnunet_ext/training/network_training/multihead/nnUNetTrainerMultiHead.py", line 661, in on_epoch_end
    self._perform_validation()
  File "/home/akanhere/llnnunet_inference/Lifelong-nnUNet/nnunet_ext/training/network_training/multihead/nnUNetTrainerMultiHead.py", line 717, in _perform_validation
    _ = get_default_configuration(self.network_name, task, running_task, trained_on_folds['prev_trainer'][idx],\
  File "/home/akanhere/llnnunet_inference/Lifelong-nnUNet/nnunet_ext/run/default_configuration.py", line 44, in get_default_configuration
    assert network in ['2d', '3d_lowres', '3d_fullres'], \
AssertionError: The network for the nnU-Net CL extension can only be one of the following: '2d', '3d_lowres', '3d_fullres', not 'False'

This error doesn't seem to make sense since I have already specified the 3d_fullres configuration.