Execution time speed compared to ArrayFire C++

altaybrusan commented 8 years ago

Hi, I have developed a Arrayfire image processing algorithm in C++. The execution time was around 78-80 ms. Then, I re-implemented the algorithm via the .NET wrapper. Now the execution time is around 120 ms. Is it normal?

PS: I also noticed the .net wrapper leverage 50 of the GPU however the original one used 100 percents One more thing, the .net wrapped version has no "warm up" time however the original C++ version needs about 1 min to execute for the first time. I am not sure yet if its normal or not!

shehzan10 commented 8 years ago

I don't think any of us have benchmarked the performance of the .NET wrapper compared to C++. But I would certainly believe the C++ Wrapper would be faster.

One of the main reasons for this is that the .NET wrapper is using the unified backend to load DLLs, where as the native libraries can be called directly and do not spend time in loading DLLs (not sure if you are counting this towards your time).

Ideally, to benchmark, you should run the main part of your code in a loop and then take the average time.

Once the wrapper is feature complete, we can look into optimizing it further.

pavanky commented 8 years ago

@altaybrusan can you show us the output of af::info() from C++ and the equivalent from .net ?

altaybrusan commented 8 years ago

First, the project is winform and I call the filter function AFTER I call setBackend. The out put of Wrapper is: Quadro_600 CUDA v7.5 2.1

same as the out put of the C++. [0] Quadro 600, 1024 MB, CUDA Compute 2.1

Do you need any further details or something else? At the end thanks Arrayfire team especially you Pavanky

altaybrusan commented 8 years ago

Indeed the problem that I have is "How to send a bitmap object to arrayfire" There is a loadMem function but I coud not find a way to send images into. Do you have any idea about how I can directly send a bitmap object to native C++ functions? @shehzan10 I had tested it in that way, however still the same results

altaybrusan commented 8 years ago

I also monitor the NVidiea control panel>Manage GPU Utilization> GPU Utilization Graph. The original C++ functions utilize GPU completely, however the wrapper reaches just the 50%. Do you have any similar experiences or suggestions?

royalstream commented 8 years ago

@altaybrusan is the image loading process part of the loop? What code is inside the timed loop and what code is outside? I just want to make sure we're all on the same page. PS: I don't know if this helps but I can also add the loadImage function to the wrapper tomorrow.

altaybrusan commented 8 years ago

@pavanky
Q- is the image loading process part of the loop?

No the image loading is done outside Q-What code is inside the timed loop and what code is outside?

Here is the code _## inside _the filter function

// The logic of the stack is as following:
// Incoming  images are stored in a image stack with fixed depth , the stack depth (i.e. stackDepth) is a // class-level parameter.
// Then the stack content is convolved with a kernel i.e. smoothingKernel, the output is a smooth image
// The smooth image then convolved with laplacian kernel to get sharp image
// Sharp image is then trimmed for improve contrast

 public ArrayFire.Array FluoroFilter(ArrayFire.Array image)
        {
            try
            {

                // STEP ONE: Find the location of the new image in stack and put it in 
                imageIndext = imageIndext % stackDepth;
                imageStack[ArrayFire.Util.Span, ArrayFire.Util.Span, imageIndext] = image;

               // To check the performance start a timer
                Stopwatch sw = new Stopwatch();
                sw.Start();

                // STEP TWO: convolve with smoothingKernel to get smooth image
                smoothImage = ArrayFire.SignalProcessing.convolve3(imageStack, smoothingKernel);
                // STEP THREE: convolve with laplacian to get sharp image.
                sharpImage = ArrayFire.SignalProcessing.convolve(smoothImage, laplacian)[Util.Span, Util.Span, 3];

                // STEP FOUR: improve the contrast
                result= ArrayFire.Arith.Pow(sharpImage/255,gam)*Cgam;                

                // STEP Five: shift he index one more
                imageIndext++;

                 // Report the total elapsed time    
                ArrayFire.Device.Sync();
                sw.Stop();
                Debug.WriteLine(sw.ElapsedMilliseconds.ToString());
                sw.Reset();
                // FINALLY: rescale the image from [0,1] to [0,255] to turn into a bitmap    
                result = ArrayFire.Arith.Floor(result * 256);
                return result;
            }
            catch (Exception e)
            {
                Debug.WriteLine(e);
                throw;
            }
        }

        #endregion
    }

Here is the code _## Outside _the filter function

// Make an object from filter class and set its parameters
  FluoroscopeFilter filter = new FluoroscopeFilter();
   const int depth = 4;
   const int wLen = 5;
   const int camLen = 1024; 
filter.SetFilterParameters(wLen, depth, 1024);

// At the moment the images are saved on disk and just loaded.
// However, at the final release the images would come from a camera!!!
// Steps to get images loaded from disk to a arrayfire array object are:
// 1- make a buffer.
// 2- load image to a byte[] 
// 3- turn byte array to int[]
// 4- make an arrya outof int[]
// I think this approach is so niive and I was thinking if there is any other better way?

 int[,] imageBuffer;

// There are 100 images, the file path  to them are loaded in List<string> imageFilesPath
// For each image fill do the filtering operations
 for (int idx = 0; idx < imagesFiles.Count; idx++)
     {
            // Load File gets a string to an image file location then turn it to a 2D int[]
            imageBuffer= FileOperations.LoadFile(imagesFiles[idx]);

              // create array from 2D int[,]
             ArrayFire.Array arr = Data.CreateArray(imageBuffer);

             //Apply filter over array object
             arr = filter.FluoroFilter(arr);

             // the output of the filter function is of type double before making an image turn it to int-type             
             ArrayFire.Array arr2 = Data.Cast<int>(arr);
             int[,] imageBuffer2 = Data.GetData2D<int>(arr2);

   // Images are black and white at the final step turn them int[,] (2D int array) to byte[] (1D byte array)
             byte[] buffer = new byte[imageBuffer2 .GetLength(0) * imageBuffer2 .GetLength(1)];

             for (int index = 0; index < imageBuffer2 .GetLength(0) * imageBuffer2 .GetLength(1); index ++) 
                {

                    int a = (int)(index /  imageBuffer2 .GetLength(0));
                    int b = index %  imageBuffer2 .GetLength(1);

                    buffer[index ] = (byte) bimg[a, b];
                }

                // At the final step make bitmap!
                var columns = 1024;
                var rows = 1024;
                var stride = columns;                                                
                var im = new Bitmap(columns, rows, stride,
                         PixelFormat.Format8bppIndexed,
                         Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0));                

                // Set the palette for gray shades
                ColorPalette pal = im.Palette;
                for (int i = 0; i < pal.Entries.Length; i++)
                    pal.Entries[i] = Color.FromArgb(i, i, i);
                im.Palette = pal;

                // Display image on the screen!
                pictureBox.Image = im;
                pictureBox.Refresh();

altaybrusan commented 8 years ago

@pavanky As you see the process of sending an image to arrayfire from c# is really cumbersome! To send: Bitmap-> int[,] To get: array-> double[,] ->int[,] -> byte[,]-> bitmap!

it would be good to short this whole process

altaybrusan commented 8 years ago

@pavanky
In the case you may want to know how do I load bitmap objects ` public static int[,] LoadFile(string path) { try {
image = new Bitmap(path); bmpData = image.LockBits(new Rectangle(new Point(), image.Size), ImageLockMode.ReadOnly, PixelFormat.Format8bppIndexed);

            // initiate temporary variables.
            int width = bmpData.Stride;
            int hight = bmpData.Height;                
            buffer = new byte[width * hight];

            //copy bitmap data into buffer.
            Marshal.Copy(bmpData.Scan0, buffer, 0, width * hight);
            image.UnlockBits(bmpData);

            // convert byte array to integer array.
            ibuffer = buffer.Select((x) => (int)x).ToArray();

            // convert one dimension array to two dimension.     
            ibuffer2D = new int[width, hight];                

            for (int i = 0; i < width; i++)
            {
                for (int j = 0; j < hight; j++)
                {
                    ibuffer2D[i, j] = buffer[i * width + j];
                }
            }

            return ibuffer2D;

        }
        catch(Exception ex)
        {
            throw new Exception("Invalid argument", ex);
        }

    }`

pavanky commented 8 years ago

@altaybrusan I am not really familiar with windows or .Net. I was just trying to make sure the versions being picked are consistent with each other.

I'll let @royalstream (who is doing a great job developing this project) and @shehzan10 (who's helping him out) take care of this.

shehzan10 commented 8 years ago

@altaybrusan is there any reason why you are going from C++ -> .NET? Most people would start with the wrapper (because they are most comfortable with a wrapper language) and then move to C++ to get to higher performance.

altaybrusan commented 8 years ago

@royalstream @shehzan10 The main body of the project is in C# (major packages, components). Indeed, this project is going to be a real application of ArrayFire in medical image processing (Fluoroscopy). I think instead of returning back to C++, If I find a way to process the video/image stream using the wrapper would be a good evidence that the ArrayFire in general and its .NET wrapper specifically is applicable in other image/video processing. As I said the bottleneck is streaming between managed .Net and unmanaged C++ environment. Do you have any idea/ suggestion on how can I negotiate to solve this bottleneck? This would be a proof of the C++ ArrayFire/ .NET arrayFire for video/image processing. (One solution that I m wondering is to put the filter in C++ ArrayFire. When a new frame is received save it as a bitmap file on HDD then send the file path to the filter. There Filter load the image internally using load command process it !!! )

royalstream commented 8 years ago

I noticed the code inside the loop has many operations, some of them involving slicing, some of them involving convolutions, etc. In theory all the .Net to C++ marshalling shouldn't be heavy because all we're passing around are pointers to af_array objects and small objects like af_seq. Maybe you can take the milliseconds at different positions in the loop and do the same on C++, hopefully one of the operations is to blame for most of the additional delay. It could be the slicing (all those temporary af_seq objects that I need to create in .Net) but it could be something else. I would also make sure the Release build has all the possible optimizations enabled.

pavanky commented 8 years ago

@royalstream

ArrayFire performs copies on assignment if the LHS array of the assignment operation has more than one reference. Because of RAII in C++ this happens fairly rarely and copies are performed only occasionally.

For garbage collected languages such as C#, any temporary references are not cleared until the garbage collector is called. As far as arrayfire knows, those references are still in play. So it performs copies every time you do slice + assignment operations.

One hack around this would be to call the garbage collector in the wrapper right before any time af_assign_* functions are called. This might slow down other parts of C# program, but it would speed up the arrayfire parts..

pavanky commented 8 years ago

@unbornchikken brought up this issue a few days ago when talking about the arrayfire-js wrapper.

royalstream commented 8 years ago

@pavanky could we accomplish the same with a call to af_release_array?

If that's the case another option would be calling .Dispose() on every ArrayFire.Array object created inside the loop (which will in turn call af_release_array) or with using blocks (C#'s syntactic sugar to call .Dispose() automatically).

pavanky commented 8 years ago

@royalstream yes, calling af_release_array / .Dispose() on temporary arrays should work.

On a related note I am assuming the Finalize() method also calls af_release_array?

pavanky commented 8 years ago

@royalstream btw, calling Dispose still does not solve the issue of temporary arrays created during function chains.

For example when you do

c = a + b - 3;

The output of a + b is a temporary array who's memory is not cleared immediately after - 3 is performed.

This will not slow down assigns, but if enough of these arrays exist, it increases the number of buffers in arrayfire's memory manager and can eventually slow down array creation.

This may even result in out of memory errors (at which point you could call the garbage collector), but I recommend calling the garbage collector from time to time based on some criteria.

For arrayfire-r, I just keep track of the amount of memory and buffers allocated and call GC whenever 1GB of memory is allocated or when the number of buffers is arrayfires memory manager is greater than 50. These numbers are arbitrary and can be changed to the user needs.

royalstream commented 8 years ago

@pavanky yes, that's exactly what the .net wrapper is doing, I took the R wrapper as a reference. Also, invoking the garbage collector explicitly does free the temporary objects, I did some basic testing.

altaybrusan commented 8 years ago

@royalstream I have tested a simple project to see the performance of the wrapper

double val = (1 / (double)(5 * 5 * 4));
ArrayFire.Array smoothingKernel = Data.Constant<double>(val, new int[] { 5, 5, 4 });                        
ArrayFire.Array arr1,arr2;
for (int i = 0; i < 100; i++) 
{
  arr1 = Data.RandNormal<double>(1024, 1024,4 );
  arr2 = ArrayFire.SignalProcessing.convolve3(arr1, smoothingKernel);               
}

I receive this error: An unhandled exception of type 'ArrayFire.ArrayFireException' occurred in ArrayFire.dll Additional information: Device out of memory

pavanky commented 8 years ago

@royalstream @altaybrusan Should we close this issue and create a new issue for this ?

@altaybrusan BTW is there any reason you are using doubles for convolution ? The performance is going to be fairly bad for double precision on GPUs.

altaybrusan commented 8 years ago

But on the C++ I have excellent performance event with double data type!

pavanky commented 8 years ago

@altaybrusan I mean single precision performance on GPUs is usually 10x better on newer GPUs. But on older GPUs such as yours it is close to 2.5x better.

altaybrusan commented 8 years ago

Ill check it. thanks

royalstream commented 8 years ago

@altaybrusan I think pavansky is right on the money with his comment regarding the memory allocation/deallocation affecting performance. I would like to help, can you please share your definition for ArrayFire.SignalProcessing.convolve3. That's not part of my wrapper so I want to define it exactly in the same way as you do. Or well, even better, if you have a simple testing project you can share it and we can help, even better if you can include your C++ code too.

altaybrusan commented 8 years ago

@royalstream here are the codes:

 public static class SignalProcessing
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static Array convolve3(Array signal, Array filter)
        {
            IntPtr ptr;
            Internal.VERIFY(ArrayFire.Interop.AFSignal.af_convolve3(out ptr, signal._ptr, filter._ptr, Interop.af_conv_mode.AF_CONV_DEFAULT, Interop.af_conv_domain.AF_CONV_AUTO));
            return new Array(ptr);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static Array convolve(Array signal, Array filter)
        {
            IntPtr ptr;            
            Internal.VERIFY(ArrayFire.Interop.AFSignal.af_convolve2(out ptr,signal._ptr , filter._ptr, Interop.af_conv_mode.AF_CONV_DEFAULT,Interop.af_conv_domain.AF_CONV_AUTO));
            return new Array(ptr);
        }
    }

and

[DllImport(af_config.dll, ExactSpelling = true, SetLastError = false, CallingConvention = CallingConvention.Cdecl)]
        public static extern af_err af_convolve2(out IntPtr array_out, IntPtr array_signal, IntPtr array_filter, af_conv_mode mode, af_conv_domain domain);

        [DllImport(af_config.dll, ExactSpelling = true, SetLastError = false, CallingConvention = CallingConvention.Cdecl)]
        public static extern af_err af_convolve3(out IntPtr array_out, IntPtr array_signal, IntPtr array_filter, af_conv_mode mode, af_conv_domain domain);

altaybrusan commented 8 years ago

I just implement the Convolve method as you had done for the others

altaybrusan commented 8 years ago

@royalstream I think the problem with the speed is due to heavy operation in loading images then converting it into float[,]. Do you have any opinion on how can I make an array out of bitmap?

pavanky commented 8 years ago

@altaybrusan if you are doing the same thing in C++ it shouldn't matter. The problem is very likely with GC not killing off temporary references.

unbornchikken commented 8 years ago

Possible explanation: https://gitter.im/arrayfire/arrayfire?at=56fa881976b6f9de194c219d

Unfortunately without a C++ like, deterministic RAII scoping mechanism get implemented, the performance will be eaten up by unnecessary copy operations to or from the garbage.

One possible implementation could be an explicit scoping construct that I have implemented in my unofficial Julia wrapper prototype :

https://github.com/unbornchikken/julia-ml-proto/blob/master/ArrayFire/FreeList.jl https://github.com/unbornchikken/julia-ml-proto/blob/master/ArrayFire/AFArray.jl#L27

https://github.com/unbornchikken/julia-ml-proto/blob/master/ArrayFire/examples/ML/ANN.jl#L59

Or in Node.js warpper:

https://github.com/arrayfire/arrayfire-js/blob/master/src/arraywrapper.cpp#L337 https://github.com/arrayfire/arrayfire-js/blob/master/src/arraywrapper.cpp#L444 https://github.com/arrayfire/arrayfire-js/blob/master/lib/es6/scope.js

https://github.com/arrayfire/arrayfire-js/blob/master/examples/es6/machine-learning/ann.js#L64

Actually with these scoping stuff, my Julia frontend's performance is slightly better than the official C++ one's.

royalstream commented 8 years ago

@altaybrusan I agree with @pavanky and @unbornchikken I would try adding a call to .Dispose() to every temporary array you create inside the loop. Alternatively you can use C# using(...) syntax which is just syntactic sugar for a try ... finally .Dispose() Like @pavanky said, if you have chained operations like a = b + c + d you have need to split them but looking at your code I didn't see any. As I said a few days ago, If you provide small testing project (images included, that way we know we're testing the same) I can help with this. I would add some scoping like @unbornchikken did and see what we get out of it.

unbornchikken commented 8 years ago

Also consider that P/Invoke is very slow approach to call native code from .NET. You gotta use C++/CLI binding to expect the same performance, see: https://msdn.microsoft.com/en-us/library/ky8kkddw.aspx But unfortunately C++/CLI is a barely supported mess that every sane minded developer avoids at all cost, so I can stand by your decision to go with P/Invoke though. Just don't expect too much from it.

royalstream commented 8 years ago

@unbornchikken from personal experience I've never obtained any performance gains that would justify using C++/CLI. I've actually dropped entire implementations done in C++/CLI because it's barely supported and it just wasn't worth it. Data still has to be marshaled from the managed domain into the native heap and if all the parameters are simple, blittable types (like double or double[]) the performance gain is zero. In the past I've obtained wonderful performance wrapping Intel MKL's using P/Invoke, keeping the arrays in the native heap (without copying them to .NET until I really need to) and obviously keeping allocations/deallocations to a minimum. For medium/large matrices the overhead was negligible. I'm going to create a toy example using ArrayFire and calling Dispose() and share the execution times, it's probably worth it.

arrayfire / arrayfire-dotnet

Execution time speed compared to ArrayFire C++ #8