SuperElastix / elastix

Official elastix repository
http://elastix.dev
Apache License 2.0
484 stars 118 forks source link

OpenCL memory error CL_OUT_OF_RESOURCES #70

Open urlicht opened 6 years ago

urlicht commented 6 years ago

Hello, I'm having a difficulty running registration with OpenCL, getting this error message CL_OUT_OF_RESOURCES.

The moving and fixed images are about 4MB in size. Strangely, the memory usage maxes at about 2GB (the GPU on the machine has 8GB of memory, GTX 1080) and the GPU utilization maxes at 5%. Even if I decrease the number of resolutions to make the pyramid smaller, I still get the same error message:

Description: CL_OUT_OF_RESOURCES WARNING: Unable to configure the GPU. The OpenCLFixedGenericImagePyramid is switching back to CPU mode. Error: in function: opencl_context_notify Details: OpenCL error during context creation or runtime: CL_OUT_OF_RESOURCES error executing CL_COMMAND_WRITE_BUFFER on GeForce GTX 1080 (Device 0).

Any insights on the issue?

mstaring commented 6 years ago

I have no idea. I think this requires debugging.

chunlc commented 6 years ago

Hi , I also had difficulty running registration with OpenCl.The errors described by "urlicht" are very similar to those I encountered.

I tried to run the “elastix.exe” program ,the moving and fixed images are about 10MB in size. the memory usage maxes at about 1.2GB (the GPU on the machine has 6GB of memory, GTX 1060) and the GPU utilization is almost 0%. Registration iteration didn't really start, because the last characters displayed in the console are: Initialization of all components (before registration) took: 27 ms. What's worse is that the “elastix.exe”program can't exit.

Software environment:win10 | cuda 8.0 I use ITK 4.13.1 compiled with Windows 64 bit binaries | MSVS 2013 | CMake 3.7.2 : Module_ITKReview: ON ITK_USE_64BITS_IDS: ON ITK_LEGACY_REMOVE: ON ITK_USE_GPU: OFF I use elastix 4.9.0 compiled with Windows 64 bit binaries | MSVS 2013 | CMake 3.7.2 : ELASTIX_USE_OPENCL: ON USE_OpenCLFixedGenericPyramid: ON USE_OpenCLMovingGenericPyramid: ON USE_OpenCLResampler: ON OPENCL_C_VERSION_1_2: ON OPENCL_INCLUDE_DIRS: /usr/local/cuda/include OPENCL_LIBRARIES: /usr/local/cuda/lib64/libOpenCL.so OPENCL_OPTIMIZATION_MAD_ENABLE: ON OPENCL_USENVIDIA_SDK: ON USE_ALL_COMPONENTS:ON

The information about OpenCLDevice is as follows : Id: 0000000022DFF900 OpenCL version: 1.0 Version: OpenCL 1.2 CUDA Device type: GPU Vendor Id: 4318 Available: Yes Has compiler: Yes Has native kernels: No Has out of order execution: Yes Has double: Yes Has half float: No Has error correcting memory: No Has unified memory: No Compute units: 10 Clock frequency: 1341 Address bits: 64 Byte order: Little Endian Maximum work item size: OpenCLSize(1024, 1024, 64) Maximum work items per group: 1024 Maximum work items per group: 1024 Has image 2D: Yes Maximum image 2D size: OpenCLSize(16384, 32768) Has image 3D: Yes Maximum image 3D size: OpenCLSize(16384, 16384, 16384) Has writable image 3D: No Maximum samplers: 32 Maximum read images: 256 Maximum write images: 16 Preferred char vector size: 1 Preferred short vector size: 1 Preferred int vector size: 1 Preferred long vector size: 1 Preferred float vector size: 1 Preferred double vector size: 1 Preferred half float vector size: 0 Native char vector size: 1 Native short vector size: 1 Native int vector size: 1 Native long vector size: 1 Native float vector size: 1 Native double vector size: 1 Native half float vector size: 0 Float capabilities: Unknown Double capabilities: Unknown Half float capabilities: Not supported Profiling timer resolution: 1000 Maximum allocation size: 1610612736 Global memory size: 2147483648 Global memory cache type: Read write cache Global memory cache size: 163840 Global memory cache line size: 128 Local memory size: 49152 Local memory separated: Yes Maximum constant buffer size: 65536 Maximum constant arguments: 9 Default alignment: 512 Minimum alignment: 128 Maximum parameter bytes: 4352 Full profile: Yes Embedded profile: No Profile: FULL_PROFILE Driver version: 388.71 Name: GeForce GTX 1060 with Max-Q Design Vendor: NVIDIA Corporation Language Version: OpenCL C 1.2

The registration parameters associated with OpenCl are as follows: (FixedImagePyramid "OpenCLFixedGenericImagePyramid") (OpenCLFixedGenericImagePyramidUseOpenCL "true") (MovingImagePyramid "OpenCLMovingGenericImagePyramid") (OpenCLMovingGenericImagePyramidUseOpenCL "true") (Optimizer "AdaptiveStochasticGradientDescent") (Transform "BSplineTransform") (Metric "AdvancedMattesMutualInformation") (OpenCLDeviceID "1") (OpenCLDeviceType "GPU") (Resampler "OpenCLResampler") (OpenCLResamplerUseOpenCL "true")

Registration can be completed when the parameter is changed to the following. The GPU memory usage maxes at about 0.2GB and the GPU utilization is almost 0% (GPU has an instantaneous 100% utilization only at the end of the registration ). The acceleration effect of GPU doesn't seem obvious.

(FixedImagePyramid "FixedRecursiveImagePyramid") (MovingImagePyramid "MovingRecursiveImagePyramid") (Optimizer "AdaptiveStochasticGradientDescent") (Transform "BSplineTransform") (Metric "AdvancedMattesMutualInformation") (OpenCLDeviceID "1") (OpenCLDeviceType "GPU") (Resampler "OpenCLResampler") (OpenCLResamplerUseOpenCL "true")

Any insights on the issue? thanks!

ZayrX commented 6 years ago

Don't know if it is directly related but I do get these errors:

ERROR: Exception during updating GPU fixed pyramid calculation: itk::ExceptionObject (000000000012E848) Location: "unknown" File: C:\elastix\Common\OpenCL\ITKimprovements\itkGPUDataManager.cxx Line: 218 Description: CL_OUT_OF_RESOURCES

WARNING: The fixed pyramid computation with OpenCL failed due to the error. The OpenCLFixedGenericImagePyramid is switching back to CPU mode. ERROR: Exception during creating GPU input image for moving generic pyramid: itk::ExceptionObject (000000000012ED48) Location: "unknown" File: c:\elastix\common\opencl\itkimprovements\itkGPUImageDataManager.hxx Line: 146 Description: CL_OUT_OF_RESOURCES

WARNING: Unable to configure the GPU. The OpenCLMovingGenericImagePyramid is switching back to CPU mode. Preparation of the image pyramids took: 3093 ms.

Problem was I have two GPUs but one was blocked by another process. Same config as chunlc: (FixedImagePyramid "OpenCLFixedGenericImagePyramid") (OpenCLFixedGenericImagePyramidUseOpenCL "true") (MovingImagePyramid "OpenCLMovingGenericImagePyramid") (OpenCLMovingGenericImagePyramidUseOpenCL "true") (Optimizer "AdaptiveStochasticGradientDescent") (Transform "BSplineTransform") (Metric "AdvancedMattesMutualInformation") (OpenCLDeviceID "1") (OpenCLDeviceType "GPU") (Resampler "OpenCLResampler") (OpenCLResamplerUseOpenCL "true")

If have two GPUs (both Nvidia Titan X). I also tried changing the DeviceID to "0".

One other thing, is that I get a gray pop-up window called "OutputWindow" with another OpenCL error: C:\elastix\Common\OpenCL\ITKimprovements\itkOpenCLC... Error: in function : itk::opencl_context_notify Details: OpenCL error during context creation or runtime: CL_OUT_OF_RESOURCES error executing CL_COMMAND_WRITE_BUFFE...

Unfortunately the window size cuts off the full error message text (screenshot below).

errormessages

chunlc commented 6 years ago

Dear ZayrX ,The problem you encountered is basically the same as mine.Information displayed in the above console and "OutputWindow" windows certainly proved this. I tried to debug this error, unfortunately it happened when the opencl was executed, and I have no experience with opencl programming and debugging. I sincerely hope that everyone have good ideas about this issue.Thanks!

mstaring commented 5 years ago

At the moment we have no resources to work on this issue ourselves.

kaspermarstal commented 5 years ago

But as always we are happy to accept PRs!

HainBuche commented 5 years ago

Hi everyone, any update on this issue? I am encountering the same problem right now, only on a linux Ubuntu 16.04.5 system. Thanks in advance!

mstaring commented 5 years ago

Just to be clear, we are not working on this issue ourselves. We welcome PRs though :-)

vzickus commented 5 years ago

Having the same issue here. I guess this must be to do with OpenCL rather than Elastix? I am also using OPENCL_C_VERSION_1_2: ON , albeit slightly different version of CMake and on ubuntu 18.04. I wonder what would be a good way to test OpenCL by itself. Did you @HainBuche @urlicht @chunlc @ZayrX investigated this further (for those using ubuntu, what is your clinfo output)? If @mstaring @kaspermarstal have any suggestions where to start I would be happy to try and do some testing.

ZayrX commented 5 years ago

I didn't investigate the problem further. I wonder if anyone doesn't have this problem but is using OpenCL? I think it's a problem how OpenCL is used in the elastix code but I am no OpenCL expert...

vzickus commented 5 years ago

@ZayrX good point, would be useful to know who was successful using OpenCL. I guess one option would be to try something with native ITK (my superficial understanding is that the Elastix team based their OpenCL implementation on what was available in ITK already?).

vzickus commented 5 years ago

Upgraded my nvidia drivers to 418.xx from 380.xx but still getting the same issue.

HainBuche commented 5 years ago

Maybe one of the developers could also post all the flags used for ITK and Elastix for the build that worked for them and which programming enviroment etc. they used? Thanks in advance!

dpshamonin commented 5 years ago

I have the same experience with this CL_OUT_OF_RESOURCES error. I tried many times, many hours to understand it and never get close. This error is not reproducible with unit test for the pyramids we are running in elastix (itkGPUGenericMultiResolutionPyramidImageFilterTest.cxx), it works just fine both on CPU and GPU. I also tried to debug elastix with Intel OpenCL on CPU, the error is the same. What does CL_OUT_OF_RESOURCES means on the 64Gb machine running OpenCL on CPU where even 2Gb are not used? I have no idea. Inside the elastix the same principles are used to run the OpenCL pyramid code as in that unit test. The unit test works, but pyramids in elastix does not. On both CPU and GPU that happens without clear explanation. My theory is that OpenCL context is created to early in elaxtix CreateOpenCLContext(), and somehow falls off later. Should be more close to the unit test execution (create-run-close), and not (create-wait-do other staff-run (CL_OUT_OF_RESOURCES)). I hope later Intel OpenCL debugger could give more feedback or there is a way run the OpenCL driver in debug mode. We could try to contact Intel OpenCL engineers to run it and give us some more feedback about this issue. They have great responsive support and from my experience they care about OpenCL. From other hand OpenCL errors are extremely difficult to interpret, they are very often misleading. In addition the NVidia sabotage of the implementation OpenCL standard, debugger, tools and etc. does not make it easy to work with. If you ever find the course of this mystery, keep me posted.

jiangliMED commented 5 years ago

I have the same experience with this CL_OUT_OF_RESOURCES error. I tried many times, many hours to understand it and never get close. This error is not reproducible with unit test for the pyramids we are running in elastix (itkGPUGenericMultiResolutionPyramidImageFilterTest.cxx), it works just fine both on CPU and GPU. I also tried to debug elastix with Intel OpenCL on CPU, the error is the same. What does CL_OUT_OF_RESOURCES means on the 64Gb machine running OpenCL on CPU where even 2Gb are not used? I have no idea. Inside the elastix the same principles are used to run the OpenCL pyramid code as in that unit test. The unit test works, but pyramids in elastix does not. On both CPU and GPU that happens without clear explanation. My theory is that OpenCL context is created to early in elaxtix CreateOpenCLContext(), and somehow falls off later. Should be more close to the unit test execution (create-run-close), and not (create-wait-do other staff-run (CL_OUT_OF_RESOURCES)). I hope later Intel OpenCL debugger could give more feedback or there is a way run the OpenCL driver in debug mode. We could try to contact Intel OpenCL engineers to run it and give us some more feedback about this issue. They have great responsive support and from my experience they care about OpenCL. From other hand OpenCL errors are extremely difficult to interpret, they are very often misleading. In addition the NVidia sabotage of the implementation OpenCL standard, debugger, tools and etc. does not make it easy to work with. If you ever find the course of this mystery, keep me posted.

I also encountered the same CL_OUT_OF_RESOURCES problem. Seems the error was thrown during the conversion from itk::Image<PixelType,Dimension> to itk::GPUImage<PixelType, Dimension>?

kaspermarstal commented 5 years ago

@Li19891208 how did you diagnose that this was the problem? Will be useful to know in order to fix it.

jiangliMED commented 5 years ago

@kaspermarstal I just simply searched where the exception was throw in the source code, and found one in function definition "BeforeGenerateData(void) " implemented in elxOpenCLFixedGenericPyramid.hxx. But i didn't go any deeper.

goldenratio1618 commented 5 years ago

I also have this error (Nvidia GTX 1070) and I dug around a bit more in the code. I think the problem isn't actually due to writing buffer to the GPU. I put a clFinish command (which causes the OpenCL kernel to execute all queued commands) right before and after each instance of writing to the buffer with clEnqueueWriteBuffer. The clFinish commands right after writing to the buffer execute successfully, but one of the clFinish commands right before a clEnqueueWriteBuffer command crashes with a different error message: CL_INVALID_COMMAND_QUEUE. It appears likely that the crash is caused by a bug in the OpenCL kernel itself rather than any commands sent to it, though I have not confirmed this yet.

chunlc commented 4 years ago

Now we really need to find a software and graphics configuration that can run correctly, and use it as a reference to compare versions. Of course, you can also test whether the hardware resources are really insufficient after replaced with an advanced graphics card(I have error (Nvidia GTX 1060), you can try Nvidia GeForce RTX series graphics card as mentioned in the topic "Building with OpenCL # 107".).thank you all.

squll1peter commented 4 years ago

I added clFinish(); after every clEnqueue*() commands in elastix source code to ensure that the queued command is executed on OpenCL device right after queuing, and run the program with cuda-memcheck utility provided by nvidia as suggested in the comment in this thread GPU memory error is thrown afterclEnqueueNDRangeKernel with Kernel CastImageFilter with following output: ========= CUDA-MEMCHECK ========= Invalid global write of size 4 ========= at 0x00000380 in CastImageFilter ========= by thread (3,3,1) in block (4,1,1) ========= Address 0x000079fc is out of bounds ========= Device Frame:CastImageFilter (CastImageFilter : 0x380) ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 [0x109f4f] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk12OpenCLKernel12LaunchKernelEv + 0x6aa) [0x45ac780] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk19OpenCLKernelManager12LaunchKernelEmRKNS_10OpenCLSizeES3S3 + 0xdc) [0x45b175a] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk26GPUUnaryFunctorImageFilterINS_5ImageIfLj3EEENS_8GPUImageIfLj3EEENS_7Functor7GPUCastIffEENS_15CastImageFilterIS2_S4_EEE15GPUGenerateDataEv + 0xad3) [0x2ce864f] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk18GPUCastImageFilterINS_5ImageIfLj3EEENS_8GPUImageIfLj3EEEE15GPUGenerateDataEv + 0x28e) [0x2ce7768] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk21GPUImageToImageFilterINS_5ImageIfLj3EEENS_8GPUImageIfLj3EEENS_15CastImageFilterIS2_S4_EEE12GenerateDataEv + 0x3d2) [0x2ce4dfe] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk13ProcessObject16UpdateOutputDataEPNS_10DataObjectE + 0x285) [0x43fbdc5] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk9ImageBaseILj3EE16UpdateOutputDataEv + 0x74) [0x2390f6c] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk37SmoothingRecursiveGaussianImageFilterINS_8GPUImageIfLj3EEES2_E12GenerateDataEv + 0xb43) [0x2731fbd] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk13ProcessObject16UpdateOutputDataEPNS_10DataObjectE + 0x285) [0x43fbdc5] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk9ImageBaseILj3EE16UpdateOutputDataEv + 0x74) [0x2390f6c] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk13ProcessObject16UpdateOutputDataEPNS_10DataObjectE + 0x107) [0x43fbc47] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk9ImageBaseILj3EE16UpdateOutputDataEv + 0x74) [0x2390f6c] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix [0x268bb46] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk40GenericMultiResolutionPyramidImageFilterINS_8GPUImageIfLj3EEES2_fE12GenerateDataEv + 0x14d7) [0x26a6c57] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk13ProcessObject16UpdateOutputDataEPNS_10DataObjectE + 0x285) [0x43fbdc5] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk9ImageBaseILj3EE16UpdateOutputDataEv + 0x74) [0x2390f6c] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN7elastix25OpenCLFixedGenericPyramidINS_15ElastixTemplateIN3itk5ImageIfLj3EEES4_EEE12GenerateDataEv + 0xa47) [0x26a38a1] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk13ProcessObject16UpdateOutputDataEPNS_10DataObjectE + 0x285) [0x43fbdc5] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk9ImageBaseILj3EE16UpdateOutputDataEv + 0x74) [0x2390f6c] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk39MultiResolutionImageRegistrationMethod2INS_5ImageIfLj3EEES2_E15PreparePyramidsEv + 0x961) [0x391d69b] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk39MultiResolutionImageRegistrationMethod2INS_5ImageIfLj3EEES2_E17StartRegistrationEv + 0x6d) [0x391b27f] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk39MultiResolutionImageRegistrationMethod2INS_5ImageIfLj3EEES2_E12GenerateDataEv + 0x25) [0x3928369] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk13ProcessObject16UpdateOutputDataEPNS_10DataObjectE + 0x285) [0x43fbdc5] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN3itk39MultiResolutionImageRegistrationMethod2INS_5ImageIfLj3EEES2_E17StartRegistrationEv + 0x44) [0x391b256] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN7elastix15ElastixTemplateIN3itk5ImageIfLj3EEES3_E3RunEv + 0xa07) [0x23c07bd] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN7elastix11ElastixMain3RunEv + 0x139d) [0x2376099] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (_ZN7elastix11ElastixMain3RunERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_St4lessIS7_ESaISt4pairIKS7_S7_EEE + 0x49) [0x2376b29] ========= Host Frame:/home/squll1/workspace/slicerBuild/MyProjects/elastixSBuild/bin/elastix (main + 0x1302) [0x236b809] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]

After digging into the source code, I found that during setting up the arguments of CastImageFilter kernel(implemented in GPUUnaryFunctorImageFilter::GPUGenerateData() ), a nullptr is passed as the pointer to output buffer instead, cuz the output Image buffer is not allocated( with BufferSize=0. The buffer size check is implemented in OpenCLKernelManager::SetKernelArgWithImage).

Digging further into the source code, I believe that the problem is cause by image grafting issue with different Image types.

In ITK, the expected behavior of the last child filter in a composite filter( that is, GPUCastImageFilter as the last child filter of SmoothingRecursiveGaussianImageFilter in our case. SmoothingRecursiveGaussianImageFilter is part of OpenCLFixedGenericPyramid function when enabling OpenCLFixedGenericPyramid in parameter file provided to elastix) is to graft (copy information and allocate output memory if needed) output Image of the parent composite filter, update (i.e. compute), and graft the computed output back to the output of the parent composite filter (see here ). That’s exactly what GPUCastImageFilter do in SmoothingRecursiveGaussianImageFilter. But when GPUCastImageFilter tries to graft output of parent filter in the situation that the type of input image is different from output image (In this case, Image as input type and GPUImage as output type), GPUCastImageFilter will give up allocating output GPU memory (implemented in itkGPUInPlaceImageFilter::AllocateOutput() ), and roll back to CPUSuperclass, which in turns calls inherited itk::ImageSource::AllocateOutput(), and calls GPUImage::Allocate(), which would also give up setting buffer when it realize that this is a grafting procedure(_mGraft=true) , thus leaving an unallocated OutputImage to the kernel.

The image type of GPUCastImageFilter is defined in itk itkSmoothingRecursiveGaussianImageFilter.h with template Rebind from itkImage.h (commited in 2018 Jan, this might explains why openCL no longer works after a certain point of time in 2018) But current opencl code in elastix has no GPU version of this part.

In my case, by adding GPUImage counterpart to the Public part of GPUImage declaration in itkGPUImage.h temporarily solve the problem, elastix no longer throw exception during pyramid preparation.

template <typename UPixelType, unsigned int NUImageDimension = VImageDimension>
  struct Rebind
    {
      using Type = itk::GPUImage<UPixelType, NUImageDimension>;
    };

  template <typename UPixelType, unsigned int NUImageDimension = VImageDimension>
    using RebindImageType = itk::GPUImage<UPixelType, NUImageDimension>;

But openCL issue is not solved yet. While rigid registration works fine, I still encounter error regarding to AdvancedMattesMutualInformationMetric during BSpline registration when OpenCL is enabled.

itk::ExceptionObject (0x55939499ea70) Location: "ElastixTemplate - Run()" File: /usr/local/include/ITK-5.0/itkImageToImageMetric.hxx Line: 319 Description: itk::ERROR: AdvancedMattesMutualInformationMetric(0x5593949360d0): FixedImageRegion does not overlap the fixed image buffered region

Since I'm not familiar with either ITK, elastix and openCL, it is very difficult for me to do further debugging, I'm not even sure if my temporary workaround actually solved CL_OUT_OF_RESOURCES issue. Any help/suggestion/Comment will be useful.

Environment: Hardware: Intel 9980XE, nvidia 2080Ti with nvidia driver version 440, ocl-icd-opencl-dev 2.2.11 CUDA-10.2 itk: v5.0.1 default build options and install elastix:Latest develop branch as of 2020/4/21, built with USE_OPENCL, OPENCL_C_VERSION_1_2, OPENCL_USE_NVIDIA_SDK and USE_ALL_COMPONENTS ON

chunlc commented 4 years ago

@squll1peter You're great. I have carefully read the content of your analysis above, and I think the way you modified is correct. So, I tried to modify itkGPUimage.h in elastix as you did, but unfortunately the error still exists during rigid registration. I seriously suspect that it was caused by my incomplete or incomplete modification. Could you please send me the modified files or show it? I look forward to your reply, thank you very much!

Environment: Hardware: nvidia GTX1060 with nvidia driver version 388, CUDA8.0 itk: v5.0.1 default build options and install elastix:Latest develop branch , built with USE_OPENCL, OPENCL_C_VERSION_1_2, OPENCL_USE_NVIDIA_SDK and USE_ALL_COMPONENTS ON

I have modified and recompiled elastix, modified as follows: 1

squll1peter commented 4 years ago

I've forked a branch here, modified it and recompiled it against a freshly complied and installed ITK v5.0.1 to make sure that changes only applies to elastix source code.

Before go deeper into the problem you encountered, I would like to address one possible solution to AdvancedMattesMutualInformationMetric problem I encountered in my post, cuz I think it might be related. After debugging, the problem is likely caused by another issue during grafting. FixedImageRegion does not overlap the fixed image buffered region is thrown when buffered region of an Image is smaller than requested region(implemented in itkImageToImageMetric.hxx ) . It is thrown during initializing Metrics for second level of resolution(resolution 1, first level is resolution 0) in the pyramid, not in the first level. In a pyramidFilter with n resolution, it will have n _outputImage_s, each corresponds to a resolution level in the pyramid. I printed out the _outputImage_s of OpenCLFixedPyramidFilter (by calling this->m_FixedImagePyramid->GetOutput(n)->PrintSelf() in itkMultiResolutionImageRegistrationMethod2::Initialize() ), and found out that only the first level output image is buffered (i.e. with non-zero BufferedRegion). As mentioned in last post, the output of a composite filter is grafted from the last child filter. In this case, elxOpenCLFixedGenericPyramid grafts its output images from child m_GPUPyramid in here. But the called OpenCLFixedGenericPyramid::GraftOutput() is inherited from ImageSource,and it only grafts the first output image when the objet has multiple output images ( strange, as I expect a function without specifying index would graft all outputs). The same issue exists in the code of elxOpenCLMovingGenericPyramid.hxx also.

I then replace this->GraftOutput( this->m_GPUPyramid->GetOutput() ); with a recursive version that goes through all outputs.

for(i=0 ; i< this->GetNumberOfLevels();i++){
      this->GraftNthOutput(i, this->m_GPUPyramid->GetOutput(i) );
}

in the elxOpenCLMovingGenericPyramid.hxx and elxOpenCLFixedGenericPyramid , and the registration could be completed on my machine without throwing error, with correct output(but it i only tested with very limited data). I've added above workarounds to my branch, you can review it as well.

Back to the problem that you encountered, I think one of the following points might be the cause:

  1. You're using a multiple resolution rigid registration, thus after the first resolution level, the image is not buffered properly, and an error is thrown when trying initiate ImageToImageMetric.
  2. You've run out of memory of GPU. It's odd, but the performance of current working OpenCL accelerated elastix code is not as good as I originally imagined: A. About 30-40% slower than CPU on my machine(though I'm having a 18 core CPU and elastix is very good at multithreading) B. Eats up a lot of GPU RAM. Peak GPU RAM usage is 6.7 GB when performing 4 resolution level BSpline registration between two 646464 isometric CT volumes. I'm not sure if I broke any memory management mechanism in my workaround.
  3. Different OpenCL behavior on different OS/CUDA/driver version.

I'm looking forward to hear good news from you! 圖片 圖片

Environment: OS: Ubuntu 18.04 Hardware: Intel 9980XE, nvidia 2080Ti with nvidia driver version 440, ocl-icd-opencl-dev 2.2.11 CUDA-10.2 itk: v5.0.1 default build options and install elastix:Latest develop branch as of 2020/4/21, built with USE_OPENCL, OPENCL_C_VERSION_1_2, OPENCL_USE_NVIDIA_SDK and USE_ALL_COMPONENTS ON

chunlc commented 4 years ago

@squll1peter Thank you for your professional and comprehensive analysis, it has benefited me a lot. Thank you sincerely and congratulations that you have solved the problem.

The reason why the error I mentioned last time did not disappear I have found out: the input image pixel type was float but was incorrectly set to short in the parameter file (I was too careless to find it in time).

Now, I have been able to run the 3D rigid registration without errors, but I did encounter a situation where the registration result is incorrect. This is a new problem: 1

As shown above, see the red marks: To facilitate the presentation of the problem, the fixed and moving images take the same image. In the registration process, the Metric value was getting worse; the accumulation of StepSize value was getting bigger and bigger. The final output of the registration result was undoubtedly wrong.

I have done some experiments and have determined the following:

  1. Because Elastix5.0 also has this problem when OpenCL-related modules are not used, it can be determined that it has nothing to do with your modified content..
  2. The registration result of Elastix4.8 is normal. This problem did not appear until Elastix4.9.

The parameter file and images used in my test are as follows, parameter file: parameters_Rigid3D.txt

image: Elastix5.0 \Testing\Baselines\3DCT_lung.mha

This problem can be easily reproduced,have you ever encountered this problem, thanks in advance!

squll1peter commented 4 years ago

I'm afraid that I cannot fully reproduce your problem, the parameter file your provided is a CPU version rigid registration, and I'm not sure what kind of modification you've done to make it an OpenCL version parameter file. I can only make a OpenCL version parameter file out of the one you provided by manually change to OpenCL pyramids and resampler. I'm not familiar with elastix, and thus I'm not familiar with the meaning of Metrics (I'll be good if you can provide some information), but I agree that seeing an "changing" trend in Metrics when registration between two identical volume is odd, and I'm observing an similar trend in both CPU and OpenCL version of implementation: (CPU, FixedShrinkingImagePyramid) 圖片

(GPU, OpenCLFixedGenericImagePyramid) 圖片

On the other hand, after a recent research trying to fix GPU memory usage issue(described below), I'm able to run a multi-resolution BSpline registration between two 512x512x400 CT volumes on my machine with peak usage of 8GB of VRAM, and the result seemed satisfactory: 圖片

(Subtracted Result shown in 3D-Slicer Up: before registrataion Down: after registration)

I use the parameter files provided in SlicerElastix extention of 3D slicer Its a two step registration combining Rigid and BSpline transformation, with following modification in both parameter files. (Resampler "OpenCLResampler") (FixedImagePyramid "OpenCLFixedGenericImagePyramid") (MovingImagePyramid "OpenCLMovingGenericImagePyramid")

Maybe you can try it out.

Continued on the issue of large GPU ram usage, I added commands to print out GPU memory usage (In my case, std::system("nvidia-smi); ) between each GPU filters, to inspect which filter occupies most of the GPU RAM. I found that :

  1. GPURecursiveGaussianImageFilter (the smoother filter of pyramid production, described in detail in the the citing journal article of elastix OpenCL implementation) occupies a lot of the memory( About 3~5GB during pyramid generation) after clEnqueueNDRange (kernel execution).
  2. When provided with more than one parameter file, memory is not released after registration is done with prior parameter file.

Regarding to first issue, I looked into the OpenCl C file, and noticed that the variable that uses most memory in this kernel are three float type array[BUFFSIZE]. BUFFSIZE is determined in itkGPURecursiveGaussianImageFilter.hxx as "One-third of local memory size", by doing so, these three arrays occupies all local memory (16 KB in original implementation, allows largest 1365 float pixels along calculating axis. ). But with modern GPU card, taking 2080 Ti for example, the local memory size is larger with 48 KB, resulted in an overkill size of array capable of storing 4096 pixels along computing axis. Although 48KB per kernel sums up to only about 200MB (48KB64CUDA cores/SM 68SMs/card) if all computing unit is occupied, it seems that the driver would require extra memory to run the kernel and context(see this thread and here for suggested tuning tips), up to several times of explicitly declared memory. In my observation, the memory required is positively related to the allocated size of local memory. Since 3D medical images that I encountered rarely has more than 512 pixel on one dimension, I've added a limit to BUFFSIZE, and video RAM usage had then dropped to a reasonable range that I can register most CT volumes in 3D slicer on my machine. Speed is still below expectation, though, but I'm happy that it beats my CPU in BSpline transformation 圖片 (Greyed steps are OpenCL accelerate steps) The actual GPU Kernel execution time is very fast: (shown by enabling OPENCL_PROFILING during cmake) 圖片 So I'm wondering if most of time time is consumed in memory transfer and itk kernel codes that join filters altogether.

Regarding to issue 2, I haven't got time to work on it.

urlicht commented 4 years ago

Wow I honestly thought this wasn't going to be resolved! Thanks everyone for the contributions 👍

dennis000-wq commented 4 years ago

I have no idea. I think this requires debugging.

How do we go about that i am also having the same issue on my desktop.....

mstaring commented 4 years ago

https://github.com/SuperElastix/elastix/blob/e868fcc45703ab110722c6498b30248335ad5b3c/Common/OpenCL/Filters/itkGPURecursiveGaussianImageFilter.hxx#L51

@dpshamonin Can we modify the code like this:

this->m_DeviceLocalMemorySize = (localMemSize / 3) / sizeof(float); --> this->m_DeviceLocalMemorySize = std::min( (localMemSize / 3) / sizeof(float), 1024 );

And add a check somewhere with a proper error message in case a larger image than 1024 in any dimension is used?

jakob1379 commented 3 years ago

I have come by this thread as I have encountered the same problems. I beg for the developers who made the testing with OpenCL to publish their functioning environment (version and CMake parameters for OpenCL, ITK and Elastix) as this will be able to help us understand the difference between then and now.

1989HD commented 3 years ago

Dear all,

Just as @jakob1379 I am wondering if there are currently any (and if so, which) versions of ITK & Elastix that allow the use of OpenCL Pyramids and Resampler?

I've been going back to multiple earlier versions and setting multiple different cmake flags, but up to now I can either not build the solution, or the build is successful but I'm not able to use the pyramids or resampler due to _'out_ofresources' of 'not installed' issues (even though the cmake flag was definitely ON).

Any help or pointers would be highly appreciated. Thanks in advance!

Hans

ntatsisk commented 2 years ago

Thanks everyone for the comments in this issue, and special thanks to @urlicht for bringing this up and, of course, to @squll1peter for providing the original solution for the OpenCL pyramids! With the help of @N-Dekker and @dpshamonin, we have merged the two recent PRs (#734 & #741) in the main branch that fix the OpenCL pyramids and resampler respectively.

Is it working for everyone now? @urlicht, @chunlc @ZayrX @vzickus @HainBuche @jiangliMED @dennis000-wq @jakob1379 @1989HD

urlicht commented 2 years ago

Thanks @ntatsisk and everyone for the updates! I just built the new code (main) and tried it out. However, I saw no performance improvement compared to the CPU run. Just replied with more details own here https://github.com/SuperElastix/elastix/issues/226#issuecomment-1322518523