Open wsonia opened 2 years ago
hello @wentj897, can I know which device you are using? From my experience, it can be due to out of memory error, you can reduce memory requirement by:
batch_size
, num_samples
, train_patch_size
for training stepsw_batch_size
, val_patch_size
for validation stepYou are having a problem with the validation step so I think you should reduce val_patch_size
to a smaller number like 64x64x64
that can fit with your device. Additionally, the training step will use even larger memory so I think you should reduce them also.
thank you very much for your open source code. I also encountered this problem in the training stage. How do you solve it? My GPU is 3090, cuda11.3.I've tried to reduce batch_size, num_samples,train_patch_size, but it not work.
path/to/luna/ imgs segs
Is the file extracted from subset0-subset9 stored in folder imgs? Is the extracted file seg-luns-luna16 stored in folder segs? Do they need any other pretreatment?
Hello @kingjames1155 , sorry for my late reply. For your first question, if you have the same problem as @wentj897 , actually you are getting the error at the validation step, not the training step. Hence, try to reduce the val_patch_size and sw_batch_size first to see if it can solve the problem. For your second question, the answer is yes, you just need to extract LUNA dataset as it is. I also point out some error files in the README that you need to remove.
I deploy the same environment and use the public cardiac data to run the code. But got this problem while training: Validation sanity check: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 137, in
trainer.fit(net, datamodule=data_module)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
self.training_type_plugin.start_training(self)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
return self._run_train()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train
self._run_sanity_check(self.lightning_module)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in _run_sanity_check
self._evaluation_loop.run()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, *kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 109, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(args, kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
return self.training_type_plugin.validation_step(step_kwargs.values())
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 444, in validation_step
return self.model(args, kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, *kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(inputs[0], kwargs[0])
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 92, in forward
output = self.module.validation_step(*inputs, *kwargs)
File "/3D-UCaps-main/module/ucaps.py", line 265, in validation_step
val_outputs = sliding_window_inference(
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/monai/inferers/utils.py", line 130, in sliding_window_inference
seg_prob = predictor(window_data, args, kwargs).to(device) # batched patch segmentation
File "/3D-UCaps-main/module/ucaps.py", line 171, in forward
x = self.feature_extractor(x)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, *kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 572, in forward
return F.conv3d(input, self.weight, self.bias, self.stride,
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8
./train_ucaps_cardiac.sh: line 25: 171684 Aborted (core dumped) python train.py --log_dir ./3D-UCaps-main/logs_heart --gpus 1 --accelerator ddp --check_val_every_n_epoch 5 --max_epochs 100 --dataset task02_heart --model_name ucaps --root_dir ./3D-UCaps-main/Task02_Heart --fold 0 --cache_rate 1.0 --train_patch_size 128 128 128 --num_workers 64 --batch_size 1 --share_weight 0 --num_samples 1 --in_channels 1 --out_channels 2 --val_patch_size(UCaps