Beckschen / 3D-TransUNet

This is the official repository for the paper "3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers"
Apache License 2.0
186 stars 11 forks source link

KeyError on Single GPU Setup and Clarification on Rank Environment Variable #31

Open mariem-m11 opened 1 month ago

mariem-m11 commented 1 month ago

I am attempting to train using a single GPU setup. I modified the config file to utilize nnUNetTrainerV2 instead of nnUNetTrainerV2_DDP.

And modified the train.sh like so: nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 torchrun ./train.py --task="Task180_BraTSMet" --fold=${fold} --config=$CONFIG --network="3d_fullres" --resume='' --local-rank=0 --optim_name="adam" --valbest --val_final --npz

The script throws a KeyError when trying to access the patch_size from the plan_data['plans']['plans_per_stage'][resolution_index].

Error Message:


Using configuration: configs/Brats/decoder_only.yaml
Running on fold: 0

['/kaggle/working/nnUNet/nnunet/nnUNet_raw_data_base/nnUNet_raw_data/training/network_training'] nnUNetTrainerV2 nnunet.training.network_training

I am running the following nnUNet: 3d_fullres
My trainer class is:  <class 'nn_transunet.trainer.nnUNetTrainerV2.nnUNetTrainerV2'>
For that I will be using the following configuration:
I am using stage 0 from these plans
I am using sample dice + CE loss

I am using data from this folder: /kaggle/working/nnUNet/nnunet/preprocessed/Task180_BraTSMet/nnUNetData_plans_v2.1

Traceback (most recent call last):
  File "/kaggle/working/3D-TransUNet/./train.py", line 321, in <module>
    main()
  File "/kaggle/working/3D-TransUNet/./train.py", line 202, in main
    patch_size = plan_data['plans']['plans_per_stage'][resolution_index]['patch_size']
KeyError: 1
[2024-08-13 16:59:32,745] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1841) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib.python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib.python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(this._config, this._entrypoint, list(args))
  File "/opt/conda/lib.python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-13_16:59:32
  host      : ec35b85052cc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1841)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
princerice commented 2 weeks ago

I have the same problem as you. Have you solved it

/root/autodl-tmp/3dtransunet/configs/Brats/encoder_plus_decoder.yaml run on fold: 0

Please cite the following paper when using nnUNet:

Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. "nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation." Nat Methods (2020). https://doi.org/10.1038/s41592-020-01008-z

If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet

['/root/autodl-tmp/3dtransunet/training/network_training'] nnUNetTrainerV2_DDP nnunet.training.network_training ############################################### I am running the following nnUNet: 3d_fullres My trainer class is: <class 'nn_transunet.trainer.nnUNetTrainerV2_DDP.nnUNetTrainerV2_DDP'> For that I will be using the following configuration: I am using stage 0 from these plans I am using sample dice + CE loss

I am using data from this folder: /root/autodl-tmp/3dtransunet/data/nnUNet_preprocessed/Task082_BraTS2020/nnUNetData_plans_v2.1 ############################################### Traceback (most recent call last): File "train.py", line 321, in main() File "train.py", line 202, in main patch_size = plan_data['plans']['plans_per_stage'][resolution_index]['patch_size'] KeyError: 1 (nnunet) root@autodl-container-824b40a131-122b9677:~/autodl-tmp/3dtransunet# bash scripts/train.sh /root/autodl-tmp/3dtransunet/configs/Brats/encoder_plus_decoder.yaml run on fold: 0

Please cite the following paper when using nnUNet:

Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. "nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation." Nat Methods (2020). https://doi.org/10.1038/s41592-020-01008-z

If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet

['/root/autodl-tmp/3dtransunet/training/network_training'] nnUNetTrainerV2_DDP nnunet.training.network_training ############################################### I am running the following nnUNet: 3d_fullres My trainer class is: <class 'nn_transunet.trainer.nnUNetTrainerV2_DDP.nnUNetTrainerV2_DDP'> For that I will be using the following configuration: I am using stage 0 from these plans I am using sample dice + CE loss

I am using data from this folder: /root/autodl-tmp/3dtransunet/data/nnUNet_preprocessed/Task082_BraTS2020/nnUNetData_plans_v2.1 ############################################### Traceback (most recent call last): File "train.py", line 321, in main() File "train.py", line 202, in main patch_size = plan_data['plans']['plans_per_stage'][resolution_index]['patch_size'] KeyError: 1

mariem-m11 commented 2 days ago

After modifying the nnUNetTrainerV2, I'm encountering an issue where it stops after the first epoch due to problems with handling loss functions. I've made several changes, but I'm still stuck. @princerice, have you found a solution to this problem?