SuperElastix / elastix

Official elastix repository
http://elastix.dev
Apache License 2.0
481 stars 116 forks source link

No speedup from OpenCL #226

Open hbraunDSP opened 4 years ago

hbraunDSP commented 4 years ago

I recently successfully compiled Elastix 5.0.0 with GPU support, but it doesn't appear to speed up anything relative to the CPU version. I can use the (Resampler "OpenCLResampler") option but there is no improvement in speed. The GPU maxes out at 2% utilization.

Also, the (FixedImagePyramid "OpenCLFixedGenericImagePyramid") and (MovingImagePyramid "OpenCLMovingGenericImagePyramid") options give me CL_OUT_OF_RESOURCES errors if I try to use them. They also cause the OpenCLResampler to fail with the same error if they are used.

My setup: OS: CentOS Linux release 7.7.1908 (Core) GPU: Nvidia GeForce GTX 1080 Ti Cuda version: 10.2

Relevant compile settings from CMakeCache.txt:

ELASTIX_USE_OPENCL:BOOL=ON
USE_OpenCLFixedGenericPyramid:BOOL=ON
USE_OpenCLMovingGenericPyramid:BOOL=ON
USE_OpenCLResampler:BOOL=ON

I would expect GPU support to significantly reduce execution time. What can I do to speed up Elastix? Also, why are the pyramids failing to run on GPU?

NHPatterson commented 4 years ago

The pyramids likely fail because you need more VRAM on the GPU. You may be able to get them to fit in memory if you change the FixedInternalImagePixelType and ShortInternalImagePixelType which may default to float in the elastix parameter file. The pyramids requires a lot of bytes for large images, many registration resolutions and gets magnified by high bit depth like float.

As for the speed issues, I have no insight.

MiHess commented 4 years ago

I have the same problem. I have compiled Elastix 5.0.0 with GPU support but registration does not appear to be faster compared to the CPU version.

Since I have read somewhere that some people are experiencing performance gains, I wonder if it has to do with specifics of the parameter files, i.e. specific combinations of ImageSampler and ResampleInterpolater etc.

One thing that I notice is that for both cases (1) (Resampler "DefaultResampler") and (2) (Resampler "OpenCLResampler") (OpenCLResamplerUseOpenCL "true") only 137 MB are allocated on the GPU. This seems odd to me because for (1) I would expect no memory to be allocated at all and for (2) it certainly is not enough and thus indicating that the computation is not actually happening on the GPU (which might explain the lack of performance-gain.)

Setup info: GPU: Nvidia GeForce GTX 1080 Ti OS: Linux 4.4.0-116-generic (x64) with 128828 MB memory, and 16 cores @ 1200 MHz. Cuda version: 10.1

Image sizes of fixed and moving are 640 640 225

The following is the config I am using for an affine transformation, but I also don't observe any gains for more expensive calculations. (Also ignoring the OpenCL pyramid implementations for now, since they cause the CL_OUT_OF_RESOURCES errors):

// Input images.
(FixedInternalImagePixelType "short")
(FixedImageDimension 3)

(MovingInternalImagePixelType "short")
(MovingImageDimension 3)

(OpenCLDeviceID "3")
(OpenCLDeviceType "GPU")

// Image sampler.
//(ImageSampler "RandomSparseMask")
(ImageSampler "Random")

(ErodeMask "false")
(WriteResultImage "false")
(AutomaticTransformInitialization "true")
(DefaultPixelValue 0)

(HowToCombineTransforms "Compose")

// Components.
(Registration "MultiResolutionRegistration")

(FixedImagePyramid "FixedRecursiveImagePyramid")
//(FixedImagePyramid "OpenCLFixedGenericImagePyramid")
//(OpenCLFixedGenericImagePyramidUseOpenCL "true")

(MovingImagePyramid "MovingRecursiveImagePyramid")
//(MovingImagePyramid "OpenCLMovingGenericImagePyramid")
//(OpenCLMovingGenericImagePyramidUseOpenCL "true")

(Interpolator "BSplineInterpolator")
(Metric "AdvancedMattesMutualInformation")
(Optimizer "StandardGradientDescent")
(ResampleInterpolator "FinalBSplineInterpolator")

//(Resampler "DefaultResampler")
(Resampler "OpenCLResampler")
(OpenCLResamplerUseOpenCL "true")

// Metric parameters.
(NumberOfSpatialSamples 4096)
(NumberOfHistogramBins 32)

// Kind of transform.
(Transform "AffineTransform")

(NumberOfResolutions 4)
(MaximumNumberOfIterations 256)
(MaximumNumberOfSamplingAttempts 10)
(NewSamplesEveryIteration "true")
(UseAllPixels "false")

(BSplineInterpolationOrder 1)
(FinalBSplineInterpolationOrder 1)

(SP_a 500.0)
(SP_A 50.0)
(SP_alpha 0.602)

(AutomaticScalesEstimation "true")

(ImagePyramidSchedule   8 8 8   4 4 4   2 2 2   1 1 1)  // XYZ per resolution.

Any pointers would be much appreciated!

dyliu2016 commented 3 years ago

Hi, Do you know how to release gpu memory during debugging?

ntatsisk commented 2 years ago

Hi @hbraunDSP, @MiHess, @dyliu2016, thanks for the discussion here! The CL_OUT_OF_RESOURCES error (issue #70) was fixed by PR #734, and a bug related to OpenCL Resampler was fixed by PR #741. @dpshamonin run some extensive benchmarks for the Resampler, and we noticed quite some speed improvement i.e. it was orders of magnitude faster, especially for larger images! Is it possible for you to re-run your code using the latest main branch and let us know if you still observe the same behavior?

urlicht commented 1 year ago

Hi @ntatsisk and @dpshamonin: thanks for all the bug fixes/updates. I just built the latest main and no longer get the CL_OUT_OF_RESOURCES error. However, I see no speed up when using the GPU compared to the CPU version.

It seems like it's correctly using the GPU since I get the following lines in the log: e.g.

  Fixed pyramid was computed by NVIDIA RTX A4000 from NVIDIA Corporation.  Moving pyramid was computed by NVIDIA RTX A4000 from NVIDIA Corporation.Preparation of the image pyramids took: 6 ms.

I see that the GPU memory allocation goes up to about 200 MB, but the GPU utilization stays 0% throughout the registration. The total time it takes is about the same as the CPU, and there's no error inn the log file.

These are the modifications I made to the parameter file to use the GPU:

(OpenCLDeviceID "1")
(OpenCLDeviceType "GPU")
(Resampler "OpenCLResampler")
(OpenCLResamplerUseOpenCL "true")
(FixedImagePyramid "OpenCLFixedGenericImagePyramid")
(OpenCLFixedGenericImagePyramidUseOpenCL "true")
(MovingImagePyramid "OpenCLMovingGenericImagePyramid")
(OpenCLMovingGenericImagePyramidUseOpenCL "true")

Did I miss anything?

Also @ntatsisk or @dpshamonin: you mentioned getting multiple orders of magnitude speed improvement. Would you be able to share a test case (e.g. fixed img, moving img, parameter file) so that we could test and benchmark the GPU/OpenCL functionalities of elastix?

UPDATE: Turns out I still get the CL_OUT_OF_RESOURCES error if I set the pyramid schedule differently. The max allocation only goes up to about 3GB, where we have a 16GB GPU. I also tried it on a machine with a much larger memory (48GB GPU memory) and did not work.

The build is straight from the GitHub repo main branch, so I'm not sure what's going on. Getting some example/test case (images and parameters) would be tremendously helpful!

error: in function: opencl_context_notify
Details: OpenCL error during context creation or runtime:
CL_OUT_OF_RESOURCES error executing CL_COMMAND_WRITE_BUFFER on NVIDIA RTX A4000 (Device 0).
MiHess commented 1 year ago

First of all thanks for all the work on this @ntatsisk and @dpshamonin.

I have built the latest version and did some experimenting, but unfortunately I am still getting the CL_OUT_OF_RESOURCES errors. Just like @urlicht described, also for me the GPU memory allocation goes up to about 3GB (of 12GB total).

I also tried with images and a modified parameter file from the ITKElastix examples to facilitate reproducing/tracking down the error. Source: https://github.com/InsightSoftwareConsortium/ITKElastix/tree/main/examples/data

From the elastix.log file:

which elastix:   elastix
  elastix version: 5.0.1
  Git revision SHA: b581b9242157cabc3e029bd9eeeef987479ed195
  Git revision date: Thu Dec 22 21:27:19 2022 +0100
  Build date: Dec 29 2022 18:49:16
  Compiler: GCC version 11.3.0
  Memory address size: 64-bit
  CMake version: 3.25.0-rc1
  ITK version: 5.3.0
ELASTIX version: 5.0.1
Command line options from ElastixBase:
-f        data/CT_3D_lung_fixed.mha
-m        data/CT_3D_lung_moving.mha
-fMask    data/CT_3D_lung_fixed_mask.mha
-mMask    data/CT_3D_lung_moving_mask.mha
-out      itkelastix_example/
-p        data/registration/parameters.3D.NC.affine.ASGD.001.txt
-threads  unspecified, so all available threads are used
Command line options from TransformBase:
-t0       unspecified, so no initial transform used

The only changes made to parameters.3D.NC.affine.ASGD.001.txt are the following:

(OpenCLDeviceID "0")
(OpenCLDeviceType "GPU")

(FixedImagePyramid "OpenCLFixedGenericImagePyramid")
(OpenCLFixedGenericImagePyramidUseOpenCL "true")

(MovingImagePyramid "OpenCLMovingGenericImagePyramid")
(OpenCLMovingGenericImagePyramidUseOpenCL "true")

(Resampler "OpenCLResampler")
(OpenCLResamplerUseOpenCL "true")

//(FixedImagePyramid "FixedRecursiveImagePyramid")
//(MovingImagePyramid "MovingRecursiveImagePyramid")
//(Resampler "DefaultResampler")

And this results in the following errors:

ERROR: Exception during updating GPU fixed pyramid calculation:
itk::ExceptionObject (0x557312702d00)
Location: "unknown"
File: /home/mirco/elastix/elastix/Common/OpenCL/ITKimprovements/itkGPUDataManager.cxx
Line: 240
Description: CL_OUT_OF_RESOURCES

WARNING: The fixed pyramid computation with OpenCL failed due to the error.
  The OpenCLFixedGenericImagePyramid is switching back to CPU mode.

ERROR: Exception during creating GPU input image for moving generic pyramid:
itk::ExceptionObject (0x557312a87b40)
Location: "unknown"
File: /home/mirco/elastix/elastix/Common/OpenCL/ITKimprovements/itkGPUImageDataManager.hxx
Line: 167
Description: CL_OUT_OF_RESOURCES

WARNING: Unable to configure the GPU.
  The OpenCLMovingGenericImagePyramid is switching back to CPU mode.

I noticed that there are additional errors shown in the console output:

/home/mirco/elastix/elastix/Common/OpenCL/ITKimprovements/itkOpenCLContext.cxx(165): itkOpenCL generic error.
Error: in function: opencl_context_notify
Details: OpenCL error during context creation or runtime:
CL_OUT_OF_RESOURCES error executing CL_COMMAND_WRITE_BUFFER on NVIDIA GeForce GTX 1080 Ti (Device 0).

/home/mirco/elastix/elastix/Common/OpenCL/ITKimprovements/itkOpenCLContext.cxx(165): itkOpenCL generic error.
Error: in function: opencl_context_notify
Details: OpenCL error during context creation or runtime:
Unknown error executing clFlush on NVIDIA GeForce GTX 1080 Ti (Device 0).

ERROR: Exception during updating GPU fixed pyramid calculation:
itk::ExceptionObject (0x557312702d00)
Location: "unknown"
File: /home/mirco/elastix/elastix/Common/OpenCL/ITKimprovements/itkGPUDataManager.cxx
Line: 240
Description: CL_OUT_OF_RESOURCES

WARNING: The fixed pyramid computation with OpenCL failed due to the error.
  The OpenCLFixedGenericImagePyramid is switching back to CPU mode.
/home/mirco/elastix/elastix/Common/OpenCL/ITKimprovements/itkOpenCLContext.cxx(165): itkOpenCL generic error.
Error: in function: opencl_context_notify
Details: OpenCL error during context creation or runtime:
CL_OUT_OF_RESOURCES error executing CL_COMMAND_WRITE_BUFFER on NVIDIA GeForce GTX 1080 Ti (Device 0).

ERROR: Exception during creating GPU input image for moving generic pyramid:
itk::ExceptionObject (0x557312a87b40)
Location: "unknown"
File: /home/mirco/elastix/elastix/Common/OpenCL/ITKimprovements/itkGPUImageDataManager.hxx
/home/mirco/elastix/elastix/Common/OpenCL/ITKimprovements/itkOpenCLContext.cxx(165): itkOpenCL generic error.
Error: in function: opencl_context_notify
Details: OpenCL error during context creation or runtime:
Unknown error executing clFlush on NVIDIA GeForce GTX 1080 Ti (Device 0).

Line: 167
Description: CL_OUT_OF_RESOURCES

WARNING: Unable to configure the GPU.
  The OpenCLMovingGenericImagePyramid is switching back to CPU mode.
Preparation of the image pyramids took: 128 ms.

@ntatsisk and @dpshamonin, do you see where it might go wrong? Are you not getting those errors anymore on the example above? Or could you please share a set of images and parameter file for further testing?

Thanks in advance!

ntatsisk commented 1 year ago

Apologies for the late reply. I tried to reproduce the error but I couldn't. I used the setup that @MiHess shared, where the data come from https://github.com/InsightSoftwareConsortium/ITKElastix/tree/main/examples/data ("CT_3D_lung") and I attach the exact parameter file. Again same as @MiHess's but I only changed to (WriteResultImage "true") so that the resampler is also triggered. I gave it a go both in a windows and an ubuntu machines and I attach the corresponding logs. The logs are from the executables but I also tested the library versions and again no error.

Here are the files: parameters.3D.NC.affine.ASGD.001_OpenCL.txt log_executable_windows.log log_executable_ubuntu.log

Fixed pyramid was computed by NVIDIA GeForce RTX 3090 from NVIDIA Corporation . Moving pyramid was computed by NVIDIA GeForce RTX 3090 from NVIDIA Corporation .Preparation of the image pyramids took: 334 ms.

As you can see the GPU was used normally.

The (windows) setup:

which elastix:   C:\Users\kntatsis\work\opencl-error-test\elastix.exe
  elastix version: 5.1.0
  Git revision SHA: d652938573e5f193955908eba225a854b31ce36a
  Git revision date: Thu Jan 12 14:20:18 2023 +0100
  Build date: Feb 14 2023 14:34:23
  Compiler: Visual C++ version 193331630.0
  Memory address size: 64-bit
  CMake version: 3.25.0-rc1
  ITK version: 5.3.0

Note that I am using Elastix version 5.1.0 that was released recently. It shouldn't be different that the commit that you used but it is easier to reference.

@urlicht Can you share the exact pyramid setup that triggered the error in your case?

Looking forward to your replies. I will try to be more responsive so that we get to solve this issue after all this time ;)