Use Projector in a Nerual Network

spencer03172023 commented 4 months ago

Dear Kyle,

Thanks for your amazing work on CT data processing domain.

I tried to use your projector module as part of my NN for training as below. But with this, it will pop out illeage error as below "error part" show, i replace this projector with a different library, it works. Would you mind help to identify the root cause?

class projector(nn.Module): def init(self): super(projector, self).init() proj = Projector(forward_project=True, use_static=True, use_gpu=True, gpu_device=torch.device("cuda:0"), batch_size=1) numCols = 736 numAngles = 512 pixelSize = 1.2858 numRows = 1 proj.leapct.set_fanbeam(numAngles, numRows, numCols, pixelSize, pixelSize, 0.5(numRows-1), 0.5(numCols-1), proj.leapct.setAngleArray(numAngles, 360.0), 595, 1085.6) proj.leapct.set_volume(numCols, numCols, numRows, voxelWidth = 0.6641, voxelHeight=pixelSize) proj.leapct.set_flatDetector() proj.allocate_batch_data() proj.leapct.allocate_volume() self.pj = proj

def forward(self, image, options):
    fp = self.pj(image)
    fp = fp.squeeze()
    fp = fp.unsqueeze(0).unsqueeze(0)
    return fp

ERROR PART: Loaded model weights from the checkpoint at /home/midea/ai/LEAP-1.16/lightning_logs/version_123/epoch=14-step=9000.ckpt Testing DataLoader 0: 0%| | 0/200 [00:00<?, ?it/s]cudaMemcpy3D Error: invalid argument kernel failed! error name: cudaErrorIllegalAddress error msg: an illegal memory access was encountered Traceback (most recent call last): File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 785, in _test_impl results = self._run(model, ckpt_path=ckpt_path) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run results = self._run_stage() File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in _run_stage return self._evaluation_loop.run() File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator return loop_run(self, *args, kwargs) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run self._evaluation_step(batch, batch_idx, dataloader_idx) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 376, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_kwargs.values()) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook output = fn(args, kwargs) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 403, in test_step return self.model.test_step(*args, kwargs) File "/home/midea/ai/LEAP-1.16/unrolling/Optimization_main.py", line 120, in test_step out = self(x, p) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/midea/ai/LEAP-1.16/unrolling/Optimization_main.py", line 70, in forward out = self.model(x, p) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/midea/ai/LEAP-1.16/unrolling/Optimization_recon.py", line 360, in forward x = module(x, proj) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/midea/ai/LEAP-1.16/unrolling/Optimization_recon.py", line 329, in forward tmp1 = self.block1(input_data, proj) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/midea/ai/LEAP-1.16/unrolling/Optimization_recon.py", line 314, in forward intervening_res = self.projector_t(temp1, self.options) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/midea/ai/LEAP-1.16/unrolling/Optimization_recon.py", line 53, in forward bp = self.bj(proj) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/leaptorch.py", line 489, in forward return BackProjectorFunctionGPU.apply(input, self.proj_data, self.vol_data, self.param_id_t) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/leaptorch.py", line 89, in forward lct.backproject_gpu(g, f, param_id.item()) # compute input (f) from proj (g) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/midea/ai/LEAP-1.16/unrolling/Optimization_main.py", line 197, in trainer.test(network, test_loader, ckpt_path='/home/midea/ai/LEAP-1.16/lightning_logs/version_123/epoch=14-step=9000.ckpt') File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 742, in test return call._call_and_handle_interrupt( File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 67, in _call_and_handle_interrupt trainer._teardown() File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1003, in _teardown self.strategy.teardown() File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 498, in teardown self.lightning_module.cpu() File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 79, in cpu return super().cpu() File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 798, in cpu return self._apply(lambda t: t.cpu()) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) [Previous line repeated 1 more time] File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 664, in _apply param_applied = fn(param) File "/home/midea/anaconda3/envs/leap/lib/python3.9/site-packages/torch/nn/modules/module.py", line 798, in return self._apply(lambda t: t.cpu()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Testing DataLoader 0: 0%| | 0/200 [00:00<?, ?it/s]

kylechampley commented 4 months ago

Could you verify that your data (projections and volume) are all on GPU 0 and are contiguous float32 arrays?

spencer03172023 commented 4 months ago

l on GPU 0 and are contiguous float32 arrays

Thanks for your reply, Kyle. I did check below in debug mode. data.dtype is torch.float32, data.is_contiguous is true and all data is on GPU0

kylechampley commented 4 months ago

Thanks for checking. Next thing to check is the order of the data. Some CT software packages store their projection data in "sinogram order" while LEAP stores it in "projection order". In LEAP the order of the projections data is (numAngles, numRows, numCols). Is this consistent with your code?

hws203 commented 4 months ago

@spencer03172023 How about checking the memory-alignment. cudaMemcpy3D requires that the src and dst memory be aligned. The src or dst memory must therefore be allocated using cudaMallocPitch or cudaMalloc3D rather than cudaMalloc. Hope it helps you.

spencer03172023 commented 4 months ago

Thanks for checking. Next thing to check is the order of the data. Some CT software packages store their projection data in "sinogram order" while LEAP stores it in "projection order". In LEAP the order of the projections data is (numAngles, numRows, numCols). Is this consistent with your code?

Thanks, Kyle. I did check the sinogram data structure, and make it same as LEAP requirement. It can work now.

spencer03172023 commented 4 months ago

@spencer03172023 How about checking the memory-alignment. cudaMemcpy3D requires that the src and dst memory be aligned. The src or dst memory must therefore be allocated using cudaMallocPitch or cudaMalloc3D rather than cudaMalloc. Hope it helps you.

Thanks. Data array difference caused this. Thanks

LLNL / LEAP

Use Projector in a Nerual Network #84