Poor slicing performance compared to NumPy

pavanky commented 8 years ago

reported by @floopcz on over here: https://github.com/arrayfire/arrayfire/issues/1428

ArrayFire slicing seems to suffer from a performance issue. Consider the following python code, that:

calculates the dot product of two matrices, first using NumPy, than ArrayFire
calculates each column/row of the dot product separately by slicing a single column/row from one of the matrices

#!/usr/bin/env python3
from time import time
import arrayfire as af
import numpy as np

af.set_backend('cpu')
af.info()

iters = 1000
n = 512

af_A = af.randu(n, n)
af_B = af.randu(n, n)

np_A = np.random.rand(n, n).astype(np.float32)
np_B = np.random.rand(n, n).astype(np.float32)

start = time()
for t in range(iters):
    np_C = np.dot(np_A, np_B)
print('numpy - dot: {}'.format(time() - start))

af.sync()
start = time()
for t in range(iters):
    af_C = af.matmul(af_A, af_B)
af.sync()
print('arrayfire - matmul: {}'.format(time() - start))

start = time()
for t in range(iters):
    for i in range(np_B.shape[1]):
        np_C = np.dot(np_A, np_B[:, i])
print('numpy - sliced dot - column major: {}'.format(time() - start))

af.sync()
start = time()
for t in range(iters):
    for i in range(af_B.shape[1]):
        af_C = af.matmul(af_A, af_B[:, i])
af.sync()
print('arrayfire - sliced matmul - column major: {}'.format(time() - start))

start = time()
for t in range(iters):
    for i in range(np_B.shape[0]):
        np_C = np.dot(np_B[i, :], np_A)
print('numpy - sliced dot - row major: {}'.format(time() - start))

af.sync()
start = time()
for t in range(iters):
    for i in range(af_B.shape[0]):
        af_C = af.matmul(af_B[i, :], af_A)
af.sync()
print('arrayfire - sliced matmul - row major: {}'.format(time() - start))

The results are following:

ArrayFire v3.3.2 (CPU, 64-bit Linux, build f65dd97)
[0] Unknown: Unknown, 15880 MB, Max threads(1) 
numpy - dot: 1.3848536014556885
arrayfire - matmul: 1.325775146484375
numpy - sliced dot - column major: 7.156768798828125
arrayfire - sliced matmul - column major: 38.87605834007263
numpy - sliced dot - row major: 7.6784679889678955
arrayfire - sliced matmul - row major: 41.27544379234314

The results suggest that with slicing, arrayfire performance is significantly degraded compared to NumPy. I have achieved similarly distributed results also with the GPU backend. Both numpy and arrayfire are linked against Intel MKL.

Am I doing something "illegal" or is it an inefficiency of the library? Thanks.

pavanky commented 8 years ago

@floopcz Thanks for bringing this to our attention. It looks like the issue might be with the af.Index class. I will look into it.

FloopCZ commented 8 years ago

@pavanky Thank you for such a swift response! I am sorry for reporting in the wrong repository. I have performed a similar experiment in C++ instead of Python and the core library seems to be unaffected indeed:

ArrayFire v3.3.2 (CPU, 64-bit Linux, build f65dd97)
[0] Unknown: Unknown, 15880 MB, Max threads(1)
arrayfire - matmul: 1.27784
arrayfire - sliced matmul - column major: 8.0823
arrayfire - sliced matmul - row major: 5.24005

I have also compared the NumPy parts of the script and the ArrayFire parts of the script separately in the profiler, if it is of any help:

numpy - dot: 0.7875990867614746
numpy - sliced dot - column major: 4.465392827987671
numpy - sliced dot - row major: 7.043829441070557

 Performance counter stats for 'python3 -u ./bench_blas_np.py':

      49300.544661      task-clock (msec)         #    3.978 CPUs utilized          
               204      context-switches          #    0.004 K/sec                  
                23      cpu-migrations            #    0.000 K/sec                  
             6,569      page-faults               #    0.133 K/sec                  
   182,023,933,777      cycles                    #    3.692 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
   135,952,155,036      instructions              #    0.75  insns per cycle        
     9,960,177,375      branches                  #  202.030 M/sec                  
        54,945,481      branch-misses             #    0.55% of all branches        
    40,006,920,413      cache-references          #  811.490 M/sec                  
         3,669,375      cache-misses              #    0.009 % of all cache refs    
             6,569      faults                    #    0.133 K/sec                  
                23      migrations                #    0.000 K/sec                  

      12.394036037 seconds time elapsed

ArrayFire v3.3.2 (CPU, 64-bit Linux, build f65dd97)
[0] Unknown: Unknown, 15880 MB, Max threads(1) 
arrayfire - matmul: 0.6884534358978271
arrayfire - sliced matmul - column major: 42.06879711151123
arrayfire - sliced matmul - row major: 41.07923150062561

 Performance counter stats for 'python3 -u ./bench_blas_af.py':

     352124.665215      task-clock (msec)         #    4.192 CPUs utilized          
         1,072,306      context-switches          #    0.003 M/sec                  
                24      cpu-migrations            #    0.000 K/sec                  
           128,239      page-faults               #    0.364 K/sec                  
 1,294,553,131,202      cycles                    #    3.676 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
   536,611,177,149      instructions              #    0.41  insns per cycle        
    98,101,407,999      branches                  #  278.599 M/sec                  
       973,886,254      branch-misses             #    0.99% of all branches        
    78,021,194,595      cache-references          #  221.573 M/sec                  
        54,132,824      cache-misses              #    0.069 % of all cache refs    
           128,239      faults                    #    0.364 K/sec                  
                24      migrations                #    0.000 K/sec                  

      83.992813939 seconds time elapsed

FloopCZ commented 8 years ago

One more thing to consider. When I run the benchmark written in C++ on my GPU, the slicing performance is also very poor. C++ source: https://gist.github.com/FloopCZ/33d764d5f1c45505e2d8716d519e5e9c

ArrayFire v3.3.2 (CUDA, 64-bit Linux, build f65dd97)
Platform: CUDA Toolkit 7.5, Driver: 352.93
[0] Quadro 2000M, 2048 MB, CUDA Compute 2.1
arrayfire - matmul: 1.43566
arrayfire - sliced matmul - column major: 29.3293
arrayfire - sliced matmul - row major: 124.245

However, my GPU is kind of outdated so it might be an expected behavior.

pavanky commented 8 years ago

@FloopCZ slicling along first dimension is going to be in arrayfire because arrayfire is column major. The column major time comes out to be about 0.07ms per iteration which is probably as low as you can go on a gpu. The time is going to scale better if you use a larger size along the first dimension.

FloopCZ commented 8 years ago

Thank you, I will prefer to have the last dimensions fixed when slicing, whenever it is possible.

FloopCZ commented 8 years ago

Just a little more analysis. Consider the following code, which again calculates the dot product of two matrices column by column:

#!/usr/bin/env python3
from time import time
import arrayfire as af

af.set_backend('cpu')
af.info()

iters = 1000
n = 512

af_A = af.randu(n, n)
af_B = af.randu(n, n)

af.sync()
start = time()
for t in range(iters):
    for i in range(af_B.shape[1]):
        af_C = af.matmul(af_A, af_B[:, i])
af.sync()
print('arrayfire - sliced matmul - column major: {}'.format(time() - start))

Let us profile its function calls:

python -m cProfile -s tottime bench_blas_af_slice.py

The results are following:

arrayfire - sliced matmul - column major: 105.72026777267456
         47183954 function calls (47181436 primitive calls) in 106.075 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1024002    9.978    0.000   23.331    0.000 array.py:55(_create_empty_array)
   512000    8.989    0.000   27.807    0.000 blas.py:17(matmul)
  1536000    8.860    0.000   21.603    0.000 index.py:189(__init__)
  1023999    7.905    0.000    8.322    0.000 array.py:478(__del__)
   512000    7.581    0.000   63.616    0.000 array.py:1005(__getitem__)
  1024004    6.343    0.000    6.343    0.000 util.py:13(dim4)
  1024002    6.240    0.000   31.222    0.000 array.py:382(__init__)
  ...

The largest amount of time was spent in the _create_empty_array function, which was called 1024002 times (which is correct: 2 initial random arrays + 512 columns * 1000 iterations * 2 (once for matmul result, once for slice result)). However, all of these calls of _create_empty_array want to create only a handle to an array of dimensions [0 1 1 1] (because the array is filled later by the respective functions), so each single call should take almost no time at all. Also af.Index initializer, array destructor and even dim4() took considerable amount of time, even though they do almost nothing.

So let us try to call the af.util.dim4() 1024000 times:

#!/usr/bin/env python3
from time import time
import arrayfire as af

start = time()
for k in range(1024000):
  af.util.dim4()
print('dim4 loop: {}'.format(time() - start))

The results are:

dim4 loop: 4.396022319793701

What I wanted to demonstrate is, that maybe it is not an inefficiency of the ArrayFire library which causes the degraded slicing performance. Maybe it is the inefficiency of the Python itself.

pavanky commented 8 years ago

@FloopCZ Thanks for digging into this! There are a few helper functions that I implemented in python before they were implemented in upstream library. May be some of this overhead can be reduced by using upstream versions. I also need to look into improving the performance of the python code itself if needed.

FloopCZ commented 8 years ago

This one might be related: arrayfire/arrayfire#1437.

pavanky commented 8 years ago

@FloopCZ can you test after pulling in https://github.com/arrayfire/arrayfire/pull/1438 ?

FloopCZ commented 8 years ago

Unfortunately, even after arrayfire/arrayfire#1440, the results stay the same.

arrayfire / arrayfire-python

Poor slicing performance compared to NumPy #84