No speedup on CUDA - Githubissues

idontgetoutmuch commented 7 years ago

I am writing this as a ticket because I find google groups make following threads difficult.

The following code is supposed to generate 384 batches ("paths") of 10,000 random numbers ("a path") in parallel, 384 being the number of cores on my GPU.

./.stack-work/install/x86_64-osx/lts-8.0/8.0.2/bin/nvidia-device-query 
CUDA device query (Driver API, statically linked)
CUDA driver version 7.5
Detected 1 CUDA capable device

Device 0: GeForce GT 650M
  CUDA capability:                    3.0
  CUDA cores:                         384 cores in 2 multiprocessors (192 cores/MP)

{-# LANGUAGE TypeOperators #-}

{-# OPTIONS_GHC -Wall #-}

module Main ( main ) where

import Data.Array.Accelerate              as A
import Data.Array.Accelerate.LLVM.Native  as CPU
import Data.Array.Accelerate.LLVM.PTX     as GPU
import Prelude hiding ((^))
import qualified Prelude                  as P

bsd :: Exp (Int, Double) -> Exp (Int, Double)
bsd n = let
  m = (A.fst n * 1103515245 + 12345) `mod` 2^(lift (31 :: Int))
  y = A.fromIntegral (A.fst n)
  x = y / 2147483648.0
  in
  lift $
  (m, x)

main :: IO ()
main = mapM_ putStrLn $
       P.map show $
       P.map (P.sum . toList . CPU.run . fold (+) 0 . A.map A.snd . rngsFull)
             (P.take 10 [1, 385..])

rngsFull :: Int -> Acc (Array (DIM1 :. Int) (Int, Double))
rngsFull p = A.scanl (\s x -> bsd (lift (A.fst x + A.fst s, A.snd s)))
                     (lift (1 :: Int, 0.5 :: Double))
                     (seedPair p 384 10000)

seeds :: Int -> Int -> Int -> Acc (Array DIM2 Int)
seeds p m n = transpose $ use $
              A.fromList (Z :. n :. m)
                         (([p..m+p-1]) P.++ (P.replicate (m * (n - 1)) 0))

seedPair :: Int -> Int -> Int -> Acc (Array DIM2 (Int, Double))
seedPair p m n = A.zip xs ys
  where
    xs = seeds p m n
    ys = use $ A.fromList (Z :. m :. n) (P.replicate (m * n) 0.0)

When I run this on the CPU I get

 $ time ./Gillespie +RTS -N1 > out.txt
time ./Gillespie +RTS -N1 > out.txt
Gillespie: forkOS_entry: interrupted
Gillespie: forkOS_entry: interrupted

real    0m5.195s
user    0m5.315s
sys 0m0.305s
 $ time ./Gillespie +RTS -N2 > out.txt
time ./Gillespie +RTS -N2 > out.txt
Gillespie: forkOS_entry: interrupted
Gillespie: forkOS_entry: interrupted

real    0m3.698s
user    0m5.583s
sys 0m0.553s
 $ time ./Gillespie +RTS -N4 > out.txt
time ./Gillespie +RTS -N4 > out.txt
Gillespie: Gillespie: forkOS_entry: interruptedforkOS_entry: interrupted

real    0m2.903s
user    0m6.798s
sys 0m0.977s

so we see some parallelism (NB using +RTS -s gave some confusing information which I won't reproduce here).

But using the GPU where I think I should see something like a x384 speedup I get

 $ time ./Gillespie > out.txt
time ./Gillespie > out.txt

real    0m4.450s
user    0m3.773s
sys 0m0.315s

tmcdonell commented 7 years ago

There are a couple things going on here, so I'll try to unpack it. Feel free to ask for clarification etc. on each point.

Each GPU core is individually much weaker than a CPU core. They derive their speed from parallelism. This comes a few forms;
- When an instruction is stalled (e.g waiting on a memory read), the core swaps out a different thread to continue execution (a warp, actually). This is in contrast to a CPU, which relies on speculative execution, branch prediction, etc. to hide stalls. In particular, although your device has 384 physical cores, you'll need many more threads than that to actually saturate the arithmetic units / memory bandwidth.
- This difference in architecture is what allows GPUs to scale; Titan Xp has 3584 cores for example. I'd expect your GT 650 to be roughly the same power as your cpu, actually, but it depends on the workload (quite possibly we have the same machine).

Since Accelerate is (for the moment) a runtime compiler, your benchmark is actually mostly measuring compilation time. If you build accelerate and friends with debugging on you can see this:

$ ./issue376-cpu +ACC -ddump-phases -ddump-exec -ACC +RTS -N -RTS
[   2.318] phase sharing-recovery: 482.701 ms (wall), 2.184 s (cpu)
[   2.318] phase rewrite-segment-offset: 482.960 ms (wall), 2.184 s (cpu)
[   2.321] phase array-fusion: 490.809 ms (wall), 2.194 s (cpu)
[   2.399] phase compile: 73.323 ms (wall), 77.878 ms (cpu)
[   2.857] exec: scanS 63.226 ms (wall), 456.956 ms (cpu), 7.23 x speedup
[   2.867] exec: fold 1.656 ms (wall), 9.938 ms (cpu), 6.00 x speedup
[   2.867] phase execute: 65.339 ms (wall), 467.349 ms (cpu), 7.15 x speedup
1920659.9522743225
...

$ ./issue376-ptx +ACC -ddump-phases -ddump-exec -ACC
[   0.433] phase sharing-recovery: 301.485 ms (wall), 301.405 ms (cpu)
[   0.433] phase rewrite-segment-offset: 301.767 ms (wall), 301.687 ms (cpu)
[   0.434] phase array-fusion: 302.846 ms (wall), 302.762 ms (cpu)
[   0.721] phase compile: 253.823 ms (wall), 254.397 ms (cpu)
[   0.953] exec: scan <<< 4, 992, 24304 >>> 50.000 µs (wall), 49.000 µs (cpu), 79.796 ms (gpu)
[   1.029] exec: fold <<< 4, 1024, 12544 >>> 22.000 µs (wall), 32.000 µs (cpu), 35.622 ms (gpu)
[   1.029] phase execute: 116.638 ms (wall), 307.647 ms (cpu)
3972646.614979394

I rewrote your program to use run1, which is usually Super Important™:

{-# LANGUAGE RebindableSyntax #-}
{-# LANGUAGE TypeOperators    #-}
{-# OPTIONS_GHC -Wall #-}

module Main ( main ) where

import Data.Array.Accelerate              as A
import Data.Array.Accelerate.LLVM.Native  as CPU
import Data.Array.Accelerate.LLVM.PTX     as GPU
import Prelude                            hiding ((^), (==))
import qualified Prelude                  as P

bsd :: Exp (Int, Double) -> Exp (Int, Double)
bsd n = let
  m = (A.fst n * 1103515245 + 12345) `mod` 2^(lift (31 :: Int))
  y = A.fromIntegral (A.fst n)
  x = y / 2147483648.0
  in
  lift $
  (m, x)

main :: IO ()
main
  = mapM_ putStrLn
  $ P.map (show . flip indexArray Z)
  $ P.map (go . singleton)
  $ P.take 10 [1, 385..]
  where
    go          = GPU.run1 (A.sum . A.map A.snd . rngsFull)
    singleton x = fromList Z [x]

rngsFull :: Acc (Scalar Int) -> Acc (Array DIM2 (Int,Double))
rngsFull p
  = A.scanl (\s x -> bsd (lift (A.fst x + A.fst s, A.snd s)))
            (constant (1, 0.5))
            (seedPair (constant (Z:.384:.10000)) p)

seedPair :: Exp DIM2 -> Acc (Scalar Int) -> Acc (Array DIM2 (Int,Double))
seedPair sh p
  = A.zip (seeds sh p) (A.fill sh 0)

seeds :: Exp DIM2 -> Acc (Scalar Int) -> Acc (Array DIM2 Int)
seeds sh p
  = A.generate sh
  $ \ix -> let Z :. m :. n = unlift ix :: Z :. Exp Int :. Exp Int
           in  if n == 0
                 then m + the p
                 else 0

(also I wasn't sure why you were doing that sum on the host, so I just moved that to accelerate land as well).

The trace looks much nicer now:

$ ./issue376-ptx +ACC -ddump-phases -ddump-exec -ACC
[   0.136] phase sharing-recovery: 4.452 ms (wall), 4.453 ms (cpu)
[   0.136] phase rewrite-segment-offset: 4.752 ms (wall), 4.752 ms (cpu)
[   0.137] phase array-fusion: 5.824 ms (wall), 5.823 ms (cpu)
[   0.396] phase compile: 259.468 ms (wall), 258.600 ms (cpu)
[   0.663] exec: scan <<< 4, 992, 24304 >>> 46.000 µs (wall), 43.000 µs (cpu), 60.248 ms (gpu)
[   0.827] exec: foldAllM1 <<< 4, 1024, 12544 >>> 24.000 µs (wall), 24.000 µs (cpu), 36.905 ms (gpu)
[   0.832] phase execute: 98.598 ms (wall), 436.020 ms (cpu)
3971731.962091446
[   0.833] exec: foldAllM2 <<< 1, 1024, 12544 >>> 24.000 µs (wall), 21.000 µs (cpu), 17.152 µs (gpu)
[   0.832] exec: foldAllM2 <<< 4, 1024, 12544 >>> 20.000 µs (wall), 19.000 µs (cpu), 55.456 µs (gpu)
[   1.073] exec: scan <<< 4, 992, 24304 >>> 34.000 µs (wall), 34.000 µs (cpu), 52.764 ms (gpu)
[   1.150] exec: foldAllM1 <<< 4, 1024, 12544 >>> 20.000 µs (wall), 20.000 µs (cpu), 17.741 ms (gpu)
[   1.160] exec: foldAllM2 <<< 1, 1024, 12544 >>> 26.000 µs (wall), 24.000 µs (cpu), 10.528 µs (gpu)
[   1.160] phase execute: 72.459 ms (wall), 327.291 ms (cpu)
[   1.160] exec: foldAllM2 <<< 4, 1024, 12544 >>> 17.000 µs (wall), 16.000 µs (cpu), 37.856 µs (gpu)
3963692.4895572662
[   1.273] exec: scan <<< 4, 992, 24304 >>> 34.000 µs (wall), 32.000 µs (cpu), 28.340 ms (gpu)
[   1.357] exec: foldAllM1 <<< 4, 1024, 12544 >>> 18.000 µs (wall), 17.000 µs (cpu), 17.745 ms (gpu)
[   1.367] phase execute: 46.852 ms (wall), 206.667 ms (cpu)
3975233.0170230865
[   1.367] exec: foldAllM2 <<< 1, 1024, 12544 >>> 65.000 µs (wall), 183.000 µs (cpu), 30.176 µs (gpu)
[   1.367] exec: foldAllM2 <<< 4, 1024, 12544 >>> 22.000 µs (wall), 19.000 µs (cpu), 47.904 µs (gpu)
[   1.504] exec: scan <<< 4, 992, 24304 >>> 51.000 µs (wall), 49.000 µs (cpu), 28.916 ms (gpu)
[   1.577] exec: foldAllM1 <<< 4, 1024, 12544 >>> 24.000 µs (wall), 26.000 µs (cpu), 16.811 ms (gpu)
[   1.578] phase execute: 76.214 ms (wall), 210.035 ms (cpu)
[   1.578] exec: foldAllM2 <<< 1, 1024, 12544 >>> 34.000 µs (wall), 32.000 µs (cpu), 25.952 µs (gpu)
3975913.544488907
[   1.578] exec: foldAllM2 <<< 4, 1024, 12544 >>> 24.000 µs (wall), 122.000 µs (cpu), 48.128 µs (gpu)
[   1.716] exec: scan <<< 4, 992, 24304 >>> 32.000 µs (wall), 30.000 µs (cpu), 28.497 ms (gpu)
[   1.791] phase execute: 48.998 ms (wall), 212.365 ms (cpu)
[   1.791] exec: foldAllM2 <<< 1, 1024, 12544 >>> 29.000 µs (wall), 27.000 µs (cpu), 8.064 µs (gpu)
[   1.791] exec: foldAllM2 <<< 4, 1024, 12544 >>> 26.000 µs (wall), 64.000 µs (cpu), 28.608 µs (gpu)
[   1.791] exec: foldAllM1 <<< 4, 1024, 12544 >>> 18.000 µs (wall), 16.000 µs (cpu), 18.613 ms (gpu)
3968544.071954727
[   1.922] exec: scan <<< 4, 992, 24304 >>> 31.000 µs (wall), 30.000 µs (cpu), 29.025 ms (gpu)
[   1.976] exec: foldAllM1 <<< 4, 1024, 12544 >>> 18.000 µs (wall), 17.000 µs (cpu), 15.804 ms (gpu)
[   1.995] phase execute: 45.632 ms (wall), 202.783 ms (cpu)
3970245.5994205475
[   1.995] exec: foldAllM2 <<< 1, 1024, 12544 >>> 26.000 µs (wall), 24.000 µs (cpu), 9.856 µs (gpu)
[   1.994] exec: foldAllM2 <<< 4, 1024, 12544 >>> 35.000 µs (wall), 188.000 µs (cpu), 49.632 µs (gpu)
[   2.127] exec: scan <<< 4, 992, 24304 >>> 29.000 µs (wall), 28.000 µs (cpu), 27.093 ms (gpu)
[   2.199] phase execute: 45.236 ms (wall), 203.632 ms (cpu)
[   2.199] exec: foldAllM2 <<< 4, 1024, 12544 >>> 28.000 µs (wall), 27.000 µs (cpu), 25.728 µs (gpu)
[   2.199] exec: foldAllM2 <<< 1, 1024, 12544 >>> 25.000 µs (wall), 23.000 µs (cpu), 8.576 µs (gpu)
[   2.194] exec: foldAllM1 <<< 4, 1024, 12544 >>> 20.000 µs (wall), 20.000 µs (cpu), 16.369 ms (gpu)
3970222.126886368
[   2.311] exec: scan <<< 4, 992, 24304 >>> 32.000 µs (wall), 32.000 µs (cpu), 25.716 ms (gpu)
[   2.389] phase execute: 42.420 ms (wall), 189.125 ms (cpu)
[   2.389] exec: foldAllM2 <<< 1, 1024, 12544 >>> 31.000 µs (wall), 28.000 µs (cpu), 25.408 µs (gpu)
[   2.388] exec: foldAllM2 <<< 4, 1024, 12544 >>> 23.000 µs (wall), 22.000 µs (cpu), 46.816 µs (gpu)
3968068.654352188
[   2.380] exec: foldAllM1 <<< 4, 1024, 12544 >>> 25.000 µs (wall), 24.000 µs (cpu), 15.523 ms (gpu)
[   2.513] exec: scan <<< 4, 992, 24304 >>> 30.000 µs (wall), 29.000 µs (cpu), 26.172 ms (gpu)
[   2.581] exec: foldAllM1 <<< 4, 1024, 12544 >>> 25.000 µs (wall), 24.000 µs (cpu), 15.490 ms (gpu)
[   2.586] phase execute: 43.491 ms (wall), 197.177 ms (cpu)
[   2.587] exec: foldAllM2 <<< 1, 1024, 12544 >>> 25.000 µs (wall), 24.000 µs (cpu), 7.616 µs (gpu)
[   2.586] exec: foldAllM2 <<< 4, 1024, 12544 >>> 21.000 µs (wall), 20.000 µs (cpu), 47.968 µs (gpu)
3975511.1818180084
[   2.707] exec: scan <<< 4, 992, 24304 >>> 44.000 µs (wall), 43.000 µs (cpu), 26.011 ms (gpu)
[   2.761] exec: foldAllM1 <<< 4, 1024, 12544 >>> 18.000 µs (wall), 17.000 µs (cpu), 14.546 ms (gpu)
[   2.772] phase execute: 41.112 ms (wall), 184.758 ms (cpu)
3973543.7092838287
[   2.772] exec: foldAllM2 <<< 1, 1024, 12544 >>> 26.000 µs (wall), 24.000 µs (cpu), 29.792 µs (gpu)
[   2.772] exec: foldAllM2 <<< 4, 1024, 12544 >>> 29.000 µs (wall), 28.000 µs (cpu), 43.136 µs (gpu)

or...

$ time ./issue376-cpu +RTS -N1 -RTS
...
        1.03 real         0.86 user         0.15 sys

$ time ./issue376-cpu +RTS -N8 -RTS
...
        0.50 real         1.60 user         0.31 sys

$ time ./issue376-ptx +RTS -N8 -RTS
...
        1.02 real         0.88 user         0.19 sys

Hope that clears up a bit what is going on.

idontgetoutmuch commented 7 years ago

Thanks very much for such a comprehensive reply. Maybe it's because I've just woken up but the ptx version still takes longer than the cpu version with -N8 and is the same for -N1. I was hoping for a x100 speedup.

Am I right to understand that a GPU core is about x100 less powerful than a CPU core so that even though I have 384 cores that's not going to give an overall improvement? So if I want a speedup then I need something with far more GPU cores?

I can rent an Amazon machine with 36 CPU cores (not sure how they compare to my macbook cores but roughly equivalent) and without going to e.g. Wuxi or Oak Ridge I am unlikely to be able to access more. I wonder what GPU device I would need to beat 36 CPU cores? And how much would it cost?

tmcdonell commented 7 years ago

It depends on your application.

GPUs have lots of memory bandwidth available, which CPUs can only match when the working set is small and fits in cache. This is what we're seeing here. Once the data grows a bit larger than cache, performance on the CPU will drop significantly. This is one reason GPUs are particularly suited to applications where there is lots of data to process (graphics, image processing, machine learning...).

There is also a question of price/performance ratio. An E7-8867 v4 (18 cores) has an RRP of $4600, whereas the Titan Xp is $1200. Each of these have their own strengths and weaknesses.

So yeah, your MBP may have a CPU and GPU of roughly equivalent power (under certain situations), but a different way to look at that might be that now you can use both the CPU and the GPU at the same time, with the same Accelerate code.

tmcdonell commented 7 years ago

FWIW, Intel went through essentially the same problem you have now with the Xeon Phi. Anyway, I guess the question is answered now, so closing this. Feel free to reopen / start a new ticket if you still have questions.

idontgetoutmuch commented 7 years ago

Thanks very much for all your help. I don't follow what you mean by Intel having the same problem with Xeon Phi.

At the moment, I'd like the full paths not just the sum of the observations of the path (the sum was to avoid printing lots of output and force the computation) maybe scanl is killing the parallelism? I'll carry on playing and see what I can do.

AccelerateHS / accelerate

No speedup on CUDA #376