InsightSoftwareConsortium / ITKCLEsperanto

ITK filters accelerated with OpenCL via [clEsperanto](https://clesperanto.github.io/).
Apache License 2.0
4 stars 3 forks source link

OpenCL crashes for big images and high execution times per kernel #9

Open IbaiGenuaGamio opened 4 years ago

IbaiGenuaGamio commented 4 years ago

Description

OpenCL execution crashes for high computational kernel execution codes. If a triple nested loop is implemented and each loop needs to iterate over a large number of values, the execution time increases exponentially and even crashes if a watchdog timer for the graphic card is activated (this is my hypothesis). This graphic card limitation is not an ITK bug itself, but I think that ITK should warn about possible execution crashing for big images and long execution times per kernel. This bug was tested with a synthetic image and with itkGPUMeanImageFilter class. This class has (for 3D images) 3 nested loops per kernel. In each pixel, surrounding pixel values are measured and the mean is calculated. These surrounding pixels are determined by the radius specified. The higher the radius is, the more pixels are analysed for mean calculation. High radius values lead to a three nested loop execution for many iterations per loop. If high radius value is combined with high image buffer size (big image), the graphic card crashes and OpenCL cannot be accessed until a new context is created (stop execution and rerun). The error and exception thrown by OpenCL is CL_INVALID_COMMAND_QUEUE. After this error is raised, any call to OpenCL leads to CL_OUT_OF_RESOURCES error. This error is probably caused by a watchdog timer that stops kernel execution due to high execution time. In those computers where this watchdog is not activated, the execution finishes successfully, but after a very long time. The tests performed in watchdog timer deactivated computers with dummy calculations, led to these execution times:

  1. 3 nested loops with 30 iterations per loop: 0.4461s.
  2. 3 nested loops with 32 iterations per loop: 0.4482s.
  3. 3 nested loops with 34 iterations per loop: 257.08s.
  4. 3 nested loops with 36 iterations per loop: 307.92s. Those computers with watchdog timer activated crash the program before long execution times.

Steps to Reproduce

In order to reproduce this, a big image has to be created and itkGPUMeanImageFilter should be applied with a high radius value. The code posted below works for radius values less or equal four, but it crashes (or lasts too much) for values higher or equal five.

#include <iostream>

#include <itkGPUImage.h>
#include <itkGPUMeanImageFilter.h>
#include <itkImageFileReader.h>
#include <itkTimeProbe.h>

using PixelType = float;
using ImageType = itk::GPUImage<PixelType, 3>;
using MeanFilterType = itk::GPUMeanImageFilter<ImageType, ImageType>;

int
main()
{
  // Create synthetic image
  typename ImageType::RegionType imageRegion;

  typename ImageType::IndexType imageIndex;
  imageIndex[0] = 0;
  imageIndex[1] = 0;
  imageIndex[2] = 0;
  typename ImageType::SizeType imageSize;
  imageSize[0] = 500;
  imageSize[1] = 500;
  imageSize[2] = 200;

  imageRegion.SetIndex(imageIndex);
  imageRegion.SetSize(imageSize);

  typename ImageType::Pointer volume = ImageType::New();
  volume->SetRegions(imageRegion);
  volume->Allocate();

  using VolumeIteratorType = itk::ImageRegionIteratorWithIndex<ImageType>;
  VolumeIteratorType it(volume, imageRegion);

  for (it.GoToBegin(); !it.IsAtEnd(); ++it)
  {
    it.Set(1000);
  }

  // Test the execution time of the filter. Execution time is the indicative of the GPU limitation
  // in cases where the watchdog timer does not crash the program.
  MeanFilterType::Pointer mean = MeanFilterType::New();
  mean->SetInput(volume);
  mean->SetRadius(3); // normal execution
  // mean->SetRadius(6); // abnormal execution
  itk::TimeProbe timer;
  timer.Start();
  mean->Update();
  timer.Stop();
  std::cout << "Elapsed time in mean computation: " << timer.GetMean() << '\n';
}

Expected behavior

I would expect this code to perform successfully for any radius value (ideally). I would expect higher execution times for higher radius values, but those times should increase coherently with the radius value, the image size and hardware characteristics. Additionally, I would not expect such exponential rises in terms of execution time due to small kernel radius increments. It should not have a radius limit, above which the execution times increase in more than 100% with minor radius changes (e.g. radius 32 lasts 0.4s while radius 34 lasts 257s). It may be useful to provide a user warning with respect to the maximum kernel radius allowed in case of convolutional filters, based on the characteristics of images and hardware detected. At this purpose, a clarification on how to compute the algorithm limits may be useful.

Actual behavior

For every image size, there is a radius limit. Below this limit, the execution is fast. Above this limit, the execution is extremely slow and graphic card crashes if watchdog timer is activated. The radius limit depending on the image size:

  1. Size = (600, 600, 300) Radius limit = 3 – 4
  2. Size = (500, 500, 200) Radius limit = 4 – 5
  3. Size = (400, 400, 100) Radius limit = 7
  4. Size = (200, 200, 50) Radius limit = 15 Case 3 and 4, in our experiments, report different behavior on multiple execution of the same code (success or crash).

Reproducibility

This error happens every time.

Versions

This test was executed using ITK 4.13 and 5.0. The problem is not ITK itself, the error is a bad graphic card management, but may be useful to warn the user about it.

Environment

This test was performed in Windows with a Quadro P1000 graphic card and the test crashed for radius values above the limit. It was also tested in Windows with a Quadro P2000 graphic card and the limits were remaining but the execution didn´t crash. It lasted more than it should, but it did not crash as we believe the watchdog timer was not activated. The test was executed in Linux with a Quadro P2000 graphic card and the same behavior as in the second case was obtained.