JuliaGPU / ArrayFire.jl

Julia wrapper for the ArrayFire library
Other
204 stars 37 forks source link

Linear Algebra routines are slow #12

Closed ranjanan closed 7 years ago

ranjanan commented 8 years ago

Some linear algebra routines are slow on ArrayFire.

a = rand(1000, 1000) #Generate double precision random values
ad = AFArray(a) #Transfer to GPU
@time svd(a); # CPU
0.487003 seconds (43 allocations: 53.529 MB, 0.69% gc time)
@time svd(ad); # GPU
5.481788 seconds (14 allocations: 336 bytes)
@time lu(a); #CPU
0.023986 seconds (38 allocations: 22.905 MB, 14.06% gc time)
@time lu(ad); # GPU
0.057869 seconds (17 allocations: 384 bytes)
@time qr(a); # CPU
0.113873 seconds (46 allocations: 31.068 MB, 2.55% gc time)
@time qr(ad); #GPU
0.891739 seconds (14 allocations: 336 bytes)
ViralBShah commented 8 years ago

Can you post chol also? It seems strange that chol is so much faster, while the others are all slower. Something seems amiss.

Cc @andreasnoack

pavanky commented 8 years ago

Is this on CUDA ?

pavanky commented 8 years ago

Keep in mind that many consumer GPUs have terrible double precision performance (5-10x slower than single precision).

On top of that a couple of routines (QR and SVD) are slow in CUDA 7.5 for square matrices. Apparently cuSolver has been optimized for thin + tall matrices and it will be fixed in later releases.

ViralBShah commented 8 years ago

Makes sense to optimize for the skinny case. This is on Amazon GPU instances, I believe.

pavanky commented 8 years ago

The Amazon GPU instance is a virtualized GRID gpu based on GK 104. I can't find the specific numbers for that particular GPU but I am pretty confident that it is a rebranded NVIDIA Tesla K10 which is optimized for single precision performance.

And according to this spec sheet the single precision performance is 4.58 TFLOPS where as the double precision performance is 0.19 TFLOPS.

Would it be possible to run these numbers again in single precision?

ranjanan commented 8 years ago

Cholesky factorization is actually fast on ArrayFire

a = rand(2000,2000)
a = a * a'
ad = AFArray(a)
@time chol(a) #CPU
0.054437 seconds (14 allocations: 30.518 MB, 1.99% gc time)
@time chol(ad) #GPU
0.001835 seconds (8 allocations: 224 bytes)
ranjanan commented 8 years ago

@pavanky I'm still getting bad performance on single precision

a = rand(Float32, 1000, 1000)
ad = AFArray(a)
@time svd(a); # CPU
  0.411077 seconds (43 allocations: 26.796 MB, 15.15% gc time)
@time svd(ad); # GPU
  4.875213 seconds (14 allocations: 336 bytes)
@time lu(a); #CPU
  0.018612 seconds (38 allocations: 11.460 MB, 10.89% gc time)
@time lu(ad); # GPU
  0.053923 seconds (17 allocations: 384 bytes)
@time qr(a); # CPU
  0.073434 seconds (46 allocations: 15.535 MB)
@time qr(ad); #GPU
  0.805024 seconds (14 allocations: 336 bytes)

The GPU in question is GRID K520.

pavanky commented 8 years ago

Btw you shouldn't be checking cholesky factorization on a random matrix. Cholesky factorization only succeeds for symmetric, positive definite matrices. It could just be that arrayfire is bailing early for a failure case.

A good (but not guaranteed) way to create an input for cholesky would be something like the following:

a = rand(n, n)
a = n * identity(n, n) + a + transpose(a)
ranjanan commented 8 years ago

@pavanky Sorry, I had originally done a a = a*a' to make it symmetric positive definite, but didn't put that in the comment. I've edited the comment so that this reflects. I've also verified the answer generated from CPU and GPU to be the same, so I don't think ArrayFire is erroring out early. It seems to be genuinely faster in case of Cholesky.

pavanky commented 8 years ago

So the QR and SVD numbers look similar to what I am seeing on a K20. The LU numbers are slightly better than what you are noticing here.

That said, the LU performance scales better than the CPU with size.

ViralBShah commented 8 years ago

I thought GPUs would have caught up on double precision performance by now. Or is this particular one on AWS an old one?

Seems this issue can be closed.

pavanky commented 8 years ago

@ViralBShah NVIDIA specializes only a few (Tesla / Quadro) GPUs for double precision. This particular one on AWS is particularly bad for double precision. The K40 for example can get >1 TFLOPS for double precision.

ViralBShah commented 8 years ago

Cc @alanedelman

davidavdav commented 8 years ago

I have trouble showing any performance using ArrayFire on a new but consumer nvidia gtx 1080 in Float32, for things as simple as a matrix multiplication. For a test function like

function ata(a, n) 
       for i=1:n 
           b=a'a
       end
       sync()
 end

the cpu is in my test several times faster than the GPU, for various matrix sizes for a and n.

pavanky commented 8 years ago

@davidavdav which version of arrayfire are you using ? Only a recent commit in arrayfire's devel branch fixed issues on compute 6.1 GPUs (of which 1080 is one). And this only works with CUDA 8.0 Release candidate.

If you are using an older version of arrayfire, it is likely that the kernels are being compiled on the fly for GTX 1080 causing a lot of delay.

davidavdav commented 8 years ago

I compiled ArrayFire at commit 3ceb2f9 -- 12 days old. The CUDA release is indeed a 8.0-rc.

The tests in ArrayFire in build/test pass, but order of magnitudes slower for gpu than for cpu.

The benchmarks in build/examples/benchmark are odd: where I have 25 Gflops on the CPU there is 8000 Gflops on cuda, but it takes several minutes (at 100% cpu) before the process output starts, during which nothing spectacular shows up on nvidia-smi except varying memory usage of blas_cuda. Then the output gives some version info:

ArrayFire v3.4.0 (CUDA, 64-bit Linux, build 3ceb2f9)
Platform: CUDA Toolkit 8, Driver: 367.27
[0] GeForce GTX 1080, 8114 MB, CUDA Compute 6.1
pavanky commented 8 years ago

Yeah those symptoms are indicative of the kernels being generated on the fly for 6.1. There was a bug in our build system that was not properly enabling CUDA 6.1 compute. You will need the latest devel (last updated 2 days ago) to fix the issue.

davidavdav commented 8 years ago

Hi,

I upgraded ArrayFire to commit 072d507 and compiled/reinstalled, and checked out the latest ArrayFire from master. It looks like ArrayFire's tests in build/examples/benchmarks/blas_cuda and build/test now don't have the weird start-up delay, and blas_cuda show 8000+ Gflops.

But the timing of ata above stays the same, with CPU being about 5x faster than GPU, for Float32 arrays. Also, and maybe this is indicative of something, I don't see any process running on the GPU with nvidia-smi. Maybe I am missing an init function of some kind?

pavanky commented 8 years ago

@davidavdav can you check the GPU performance by putting gc() inside the for loop ?

davidavdav commented 8 years ago
julia> function ata(a, n)
       for i in 1:n
       gc(); b = a'a
       end; sync()
       end

Like that?

julia> @time ata(af, 100)
  4.750789 seconds (408 allocations: 12.719 KB, 67.91% gc time)
true

julia> @time ata(a, 100)
  3.810594 seconds (508 allocations: 381.482 MB, 86.68% gc time)

julia> typeof(af)
ArrayFire.AFArray{Float32,2}

julia> size(af)
(1000,1000)
ivfiev commented 8 years ago

using windows 10, cuda 8 and gtx 1080 transposing and multiplying seems to be quite slow in some cases:

X = AFArray(rand(Float32, 10000, 10000))

tic()
X' * X
sync()
toc() # elapsed time: 1.19689447 seconds

tic()
X * X'
sync()
toc() # elapsed time: 0.264948944 seconds
ViralBShah commented 8 years ago

Are these upstream issues?

pavanky commented 8 years ago

Hmm that doesn't seem right. @ivfiev can you try running this multiple times?

ivfiev commented 8 years ago

@pavanky same issue when running multiple times

  1.680996 seconds (64.49 k allocations: 2.884 MB)
  0.259543 seconds (739 allocations: 42.430 KB)
  1.222253 seconds (320 allocations: 18.584 KB)
  0.279180 seconds (307 allocations: 18.162 KB)
  1.192312 seconds (320 allocations: 18.584 KB)
  0.271068 seconds (307 allocations: 18.162 KB)

tried with different versions of cuda 8 (RC and current), same problem

this works fine though:

XT = X'
XT * X
# 0.27 seconds

so it happens only when we are transposing and multiplying in same expression