MIC-DKFZ / batchgenerators

A framework for data augmentation for 2D and 3D image classification and segmentation
Apache License 2.0
1.09k stars 221 forks source link

RuntimeError #111

Open Chao86 opened 1 year ago

Chao86 commented 1 year ago

Hi, @FabianIsensee , I'm using the example "multithreaded_with_batches.ipynb" to generate my own batch data, however , the RuntimeError as follow: image appeared. Can you offer me some hint to sovle this?

FabianIsensee commented 1 year ago

Hi, there must be another error message somewhere in your output. Can you look for it?

vcasellesb commented 1 year ago

Hi!

Apologies Fabian and Chao for invading this issue, but I am having a similar issue than yours and maybe I can clarify the error Chao was getting.

For context, I was trying to run nnUNet with modified code. I changed the max_num_epochs from 1000 to 400, and the lr threshold from 1e-6 to 5e-3. Furthermore, I was getting stuck at validation so I implemented the change mentioned in https://github.com/MIC-DKFZ/nnUNet/issues/902. As you can see, I commented the original code and substituted it by the following (line 662 of nnUNetTrainer):

                # changed by vicent 09/03/22 to speed up validation according to github issue #902
                # results.append(export_pool.starmap_async(save_segmentation_nifti_from_softmax,
                #                                          ((softmax_pred, join(output_folder, fname + ".nii.gz"),
                #                                            properties, interpolation_order, self.regions_class_order,
                #                                            None, None,
                #                                            softmax_fname, None, force_separate_z,
                #                                            interpolation_order_z),
                #                                           )
                #                                          )
                #                )

                save_segmentation_nifti_from_softmax(softmax_pred, join(output_folder, fname + ".nii.gz"),
                                                     properties, interpolation_order, self.regions_class_order,
                                                     None, None,
                                                     softmax_fname, None, force_separate_z,
                                                     interpolation_order_z)

I don't believe that this is the problem though, since my error happens at the very beginning of training, and this change I believe it mainly affects validation.

Anyways, this is the error message I got. As you can see, I think that there is no useful message to understand what is going on, only that the exception is happening in thread 4.

loading dataset
loading all case properties
unpacking dataset
done
2023-03-09 21:32:24.056818: lr: 0.01
using pin_memory on device 0
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
    raise RuntimeError("Abort event was set. So someone died and we should end this madness. \nIMPORTANT: "
RuntimeError: Abort event was set. So someone died and we should end this madness. 
IMPORTANT: This is not the actual error message! Look further up to see what caused the error. Please also check whether your RAM was full
Traceback (most recent call last):
  File "/home/vcaselles/anaconda3/envs/dents/bin/nnUNet_train", line 8, in <module>
    sys.exit(main())
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/run/run_training.py", line 180, in main
    trainer.run_training()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainerV2_epoch400_lr_thr_0005.py", line 441, in run_training
    ret = super().run_training()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/nnUNetTrainer_modvalidation.py", line 317, in run_training
    super(nnUNetTrainer_modvalidation, self).run_training()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/nnunet/training/network_training/network_trainer.py", line 418, in run_training
    _ = self.tr_gen.next()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 182, in next
    return self.__next__()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 206, in __next__
    item = self.__get_next_item()
  File "/home/vcaselles/anaconda3/envs/dents/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 190, in __get_next_item
    raise RuntimeError("MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of "
RuntimeError: MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of your workers crashed. This is not the actual error message! Look further up your stdout to see what caused the error. Please also check whether your RAM was full

Thank you very much for your attention, and apologies for the long and dense message, I hope I was clear enough.

Best regards,

Vicent Caselles

PS: To get the run_training.py function to work, I had to also change main() to also accept my modified trainer in the workflow. I don't think that might be the issue though.

FabianIsensee commented 1 year ago

Is the all the text output you got? Can you please share everything? Usually there is an error message hidden somewhere

FabianIsensee commented 1 year ago

Why not just use nnUNet_train?

vcasellesb commented 1 year ago

Hi Fabian, thank you very much for your response. Regarding your questions: 1) Yes, that was all the error output I got, unfortunately 2) I created a new custom nnUnet_trainer class with my custom max_num_epochs and lr threshold, both defined in the init function inside the class. Did I make a mistake doing that??

I honestly think that the issue raising the error was a lack of RAM, since I was using a server with a great GPU but terrible RAM (~2 GB or so), so odds are that that was the issue.

Thanks again for your time!!

Vicent Caselles

FabianIsensee commented 1 year ago

Yeah sounds like it. Are you certain about 2GB? That's year 2000 level of RAM

vcasellesb commented 1 year ago

Yes, it was the cheapest AWS server with CUDA... It was 4 GB tops.

Vicent