Closed leetesua closed 4 years ago
Hi, I have done an update recently that requires pytorch 1.6.0. It could be that by upgrading you also have upgraded pytorch and that the automatically installed pytorch version is incompatible with your driver because it was built with a newer version of CUDA. Please upgrade your graphics driver or downgrade nnU-Net. When posting error messages, please be sure to post the entire message, not just the end. The actual error is most often way up. Ideally you send the entire stdout from start to error ;-) Best, Fabian
first pip install --upgrade nnunet, and rerun training, got this: (py37) lidexuan@SF-BS-13:/data3/lidexuan/nnUNet/nnunet$ OMP_NUM_THREADS=1 python run/run_training.py 3d_fullres nnUNetTrainer ribfrac 4 --ndet
Please cite the following paper when using nnUNet: Fabian Isensee, Paul F. J盲ger, Simon A. A. Kohl, Jens Petersen, Klaus H. Maier-Hein "Automated Design of Deep Learning Methods for Biomedical Image Segmentation" arXiv preprint arXiv:1904.08128 (2020). If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet
then rerun training, then got this: (py37) lidexuan@SF-BS-13:/data3/lidexuan/nnUNet/nnunet$ OMP_NUM_THREADS=1 python run/run_training.py 3d_fullres nnUNetTrainer ribfrac 4 --ndet Please cite the following paper when using nnUNet:
Isensee, Fabian, et al. "nnU-Net: Breaking the Spell on Successful Medical Image Segmentation." arXiv preprint arXiv:1904.08128 (2019).
If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet ############################################### I am running the following nnUNet: 3d_fullres My trainer class is: <class 'nnunet.training.network_training.nnUNetTrainer.nnUNetTrainer'> For that I will be using the following configuration: num_classes: 2 modalities: {0: 'CT'} use_mask_for_norm OrderedDict([(0, False)]) keep_only_largest_region OrderedDict([((2,), False), ((1,), True), ((2, 1), False)]) min_region_size_per_class OrderedDict([(1, 30.55300220489502), (2, 39.52484177819552)]) min_size_per_class OrderedDict([(1, 30.55300220489502), (2, 39.52484177819552)]) normalization_schemes OrderedDict([(0, 'CT')]) stages...
stage: 0 {'batch_size': 2, 'num_pool_per_axis': [4, 5, 5], 'patch_size': array([ 96, 160, 128]), 'median_patient_size_in_voxels': array([148, 231, 231]), 'current_spacing': array([2.77089402, 1.65604239, 1.65604239]), 'original_spacing': array([1.25 , 0.74707043, 0.74707043]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[1, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}
stage: 1 {'batch_size': 2, 'num_pool_per_axis': [4, 5, 5], 'patch_size': array([ 96, 160, 128]), 'median_patient_size_in_voxels': array([329, 512, 512]), 'current_spacing': array([1.25 , 0.74707043, 0.74707043]), 'original_spacing': array([1.25 , 0.74707043, 0.74707043]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[1, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}
I am using stage 1 from these plans I am using batch dice + CE loss
I am using data from this folder: /data3/lidexuan/nnUNet/nnuet/preprocessed_data/ribfrac/nnUNet
###############################################
2020-09-11 10:41:47.192188: unpacking dataset
2020-09-11 10:41:47.335607: done
Traceback (most recent call last):
File "run/run_training.py", line 99, in
Hi, You can ignore the is_alive errors. That is just the data loader dying. As I said in my previous post, please pip uninstall torch and then reinstall an older version of pytorch that is supported with your driver Best Fabian
On Fri, Sep 11, 2020, 04:48 xbsj_ldx0908 notifications@github.com wrote:
first pip install --upgrade nnunet, and rerun training, got this: (py37) lidexuan@SF-BS-13:/data3/lidexuan/nnUNet/nnunet$ OMP_NUM_THREADS=1 python run/run_training.py 3d_fullres nnUNetTrainer ribfrac 4 --ndet
Please cite the following paper when using nnUNet: Fabian Isensee, Paul F. J盲ger, Simon A. A. Kohl, Jens Petersen, Klaus H. Maier-Hein "Automated Design of Deep Learning Methods for Biomedical Image Segmentation" arXiv preprint arXiv:1904.08128 (2020). If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet nnUNet_raw_data_base is not defined and nnU-Net can only be used on data for which preprocessed files are already present on your system. nnU-Net cannot be used for experiment planning and preprocessing like this. If this is not intended, please read nnunet/paths.md for information on how to set this up properly. nnUNet_preprocessed is not defined and nnU-Net can not be used for preprocessing or training. If this is not intended, please read nnunet/pathy.md for information on how to set this up. RESULTS_FOLDER is not defined and nnU-Net cannot be used for training or inference. If this is not intended behavior, please read nnunet/paths.md for information on how to set this up ---------? 3d_fullres ribfrac nnUNetTrainer nnUNetPlansv2.1 Traceback (most recent call last): File "run/run_training.py", line 83, in trainer_class = get_default_configuration(network, task, network_trainer, plans_identifier) File "/home/lidexuan/.local/lib/python3.7/site-packages/nnunet/run/default_configuration.py", line 40, in get_default_configuration dataset_directory = join(preprocessing_output_dir, task) File "/root/anaconda3/envs/py37/lib/python3.7/posixpath.py", line 80, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType then I downgrade nnunet by run pip install -e .,
then rerun training, then got this: (py37) lidexuan@SF-BS-13:/data3/lidexuan/nnUNet/nnunet$ OMP_NUM_THREADS=1 python run/run_training.py 3d_fullres nnUNetTrainer ribfrac 4 --ndet Please cite the following paper when using nnUNet:
Isensee, Fabian, et al. "nnU-Net: Breaking the Spell on Successful Medical Image Segmentation." arXiv preprint arXiv:1904.08128 (2019).
If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet ############################################### I am running the following nnUNet: 3d_fullres My trainer class is: <class 'nnunet.training.network_training.nnUNetTrainer.nnUNetTrainer'> For that I will be using the following configuration: num_classes: 2 modalities: {0: 'CT'} use_mask_for_norm OrderedDict([(0, False)]) keep_only_largest_region OrderedDict([((2,), False), ((1,), True), ((2, 1), False)]) min_region_size_per_class OrderedDict([(1, 30.55300220489502), (2, 39.52484177819552)]) min_size_per_class OrderedDict([(1, 30.55300220489502), (2, 39.52484177819552)]) normalization_schemes OrderedDict([(0, 'CT')]) stages...
stage: 0 {'batch_size': 2, 'num_pool_per_axis': [4, 5, 5], 'patch_size': array([ 96, 160, 128]), 'median_patient_size_in_voxels': array([148, 231, 231]), 'current_spacing': array([2.77089402, 1.65604239, 1.65604239]), 'original_spacing': array([1.25 , 0.74707043, 0.74707043]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[1, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}
stage: 1 {'batch_size': 2, 'num_pool_per_axis': [4, 5, 5], 'patch_size': array([ 96, 160, 128]), 'median_patient_size_in_voxels': array([329, 512, 512]), 'current_spacing': array([1.25 , 0.74707043, 0.74707043]), 'original_spacing': array([1.25 , 0.74707043, 0.74707043]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[1, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}
I am using stage 1 from these plans I am using batch dice + CE loss
I am using data from this folder: /data3/lidexuan/nnUNet/nnuet/preprocessed_data/ribfrac/nnUNet ############################################### 2020-09-11 10:41:47.192188: unpacking dataset 2020-09-11 10:41:47.335607: done Traceback (most recent call last): File "run/run_training.py", line 99, in trainer.initialize(not validation_only) File "/data3/lidexuan/nnUNet/nnunet/training/network_training/nnUNetTrainer.py", line 203, in initialize self.initialize_network_optimizer_and_scheduler() File "/data3/lidexuan/nnUNet/nnunet/training/network_training/nnUNetTrainer.py", line 240, in initialize_network_optimizer_and_scheduler self.network.cuda() File "/home/lidexuan/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 458, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/lidexuan/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) File "/home/lidexuan/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) File "/home/lidexuan/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) [Previous line repeated 3 more times] File "/home/lidexuan/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 376, in _apply param_applied = fn(param) File "/home/lidexuan/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 458, in return self._apply(lambda t: t.cuda(device)) File "/home/lidexuan/.local/lib/python3.7/site-packages/torch/cuda/init.py", line 186, in _lazy_init _check_driver() File "/home/lidexuan/.local/lib/python3.7/site-packages/torch/cuda/init.py", line 77, in _check_driver of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion()))) AssertionError: The NVIDIA driver on your system is too old (found version 10000). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. Exception ignored in: <function MultiThreadedAugmenter.del at 0x7fcdff638830> Traceback (most recent call last): File "/home/lidexuan/.local/lib/python3.7/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 287, in del File "/home/lidexuan/.local/lib/python3.7/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 262, in _finish AttributeError: 'NoneType' object has no attribute 'is_alive' Exception ignored in: <function MultiThreadedAugmenter.del at 0x7fcdff638830> Traceback (most recent call last): File "/home/lidexuan/.local/lib/python3.7/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 287, in del File "/home/lidexuan/.local/lib/python3.7/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 262, in _finish AttributeError: 'NoneType' object has no attribute 'is_alive'
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MIC-DKFZ/nnUNet/issues/320#issuecomment-690840846, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWHKFHXWAXZPQW3XFDY5IDSFGFYZANCNFSM4RFGASFA .
Any idea about this? I did downgrade my version of torch but still got this. My torch version is 1.2.0, CUDA = 10.0 Thank you in advance!
/pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [3868,0,0], thread: [101,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [3868,0,0], thread: [102,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [3868,0,0], thread: [103,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [3868,0,0], thread: [104,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [3868,0,0], thread: [105,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=26 error=59 : device-side assert triggered
Traceback (most recent call last):
File "run/run_training.py", line 107, in
Sometimes got this : (for different version of torch)
Traceback (most recent call last):
File "run/run_training.py", line 107, in
Your comments are marked as resolved. Can I close this issue?
yeah yeah, problem solved. Thank you !
hi,what is the metric that you use nnunet to train ribfrac dataset ?
want to train my own data, which is called RibFrac dataset, got:AttributeError: 'NoneType' object has no attribute 'is_alive'
First I run : pip install --upgrade nnunet, then rerun plan_and_preprocess_task.py, Then I run: OMP_NUM_THREADS=1 python run/run_training.py 3d_fullres ...... Got this error : TypeError: expected str, bytes or os.PathLike object, not NoneType
looks like the machine is running code in python3.7/site-packages/nnunet instead of where I git cloned it.
then I copied the code into site-packages and rerun python run/run_training.py 3d_fullres ...... Got this error : AttributeError: 'NoneType' object has no attribute 'is_alive'
Also got another issue: AssertionError: The NVIDIA driver on your system is too old (found version 10000).
But strangely, I was training LiTS dataset few days ago, it is 100% OK. How comes that today it doesn't work.
scratching my head now....