Using threaded runtime destroys performance of C calls

l29ah commented 4 years ago

‰ ghc -O2 -threaded --make inline-c-crit.hs && ./inline-c-crit +RTS -N        
Linking inline-c-crit ...
benchmarking haskell +
time                 7.679 ns   (7.570 ns .. 7.774 ns)
                     0.998 R²   (0.998 R² .. 0.999 R²)
mean                 7.594 ns   (7.495 ns .. 7.733 ns)
std dev              397.9 ps   (300.3 ps .. 528.6 ps)
variance introduced by outliers: 76% (severely inflated)

benchmarking c +
time                 182.9 ns   (182.0 ns .. 184.6 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 182.7 ns   (182.1 ns .. 183.8 ns)
std dev              2.572 ns   (1.813 ns .. 3.817 ns)
variance introduced by outliers: 15% (moderately inflated)

benchmarking cu +
time                 78.18 ns   (76.60 ns .. 79.59 ns)
                     0.998 R²   (0.998 R² .. 1.000 R²)
mean                 77.06 ns   (76.43 ns .. 78.05 ns)
std dev              2.792 ns   (1.559 ns .. 4.588 ns)
variance introduced by outliers: 56% (severely inflated)

benchmarking c block +
time                 195.8 ns   (188.0 ns .. 203.0 ns)
                     0.993 R²   (0.990 R² .. 0.999 R²)
mean                 189.3 ns   (186.6 ns .. 193.3 ns)
std dev              11.25 ns   (7.601 ns .. 14.98 ns)
variance introduced by outliers: 76% (severely inflated)

benchmarking cu block +
time                 76.09 ns   (75.35 ns .. 76.99 ns)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 78.40 ns   (76.93 ns .. 81.21 ns)
std dev              6.560 ns   (2.949 ns .. 10.09 ns)
variance introduced by outliers: 88% (severely inflated)

76ns call overhead on a modern 3.5GHz i7 is just insane. That's 266000 cycles! Disabling -N makes it much less (but still more than Haskell), but then i can't meaningfully use threads in the application.

Code:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE QuasiQuotes #-}
import Criterion.Main
import qualified Language.C.Inline as C
import qualified Language.C.Inline.Unsafe as CU
import Data.Word
import System.IO.Unsafe

C.include "<stdint.h>"

type Fun = Word32 -> Word32 -> Word32

fun :: Fun
fun = (+)

cfun :: Fun
cfun x y = [C.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]

cufun :: Fun
cufun x y = [CU.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]

cblockfun :: Fun
cblockfun x y = unsafePerformIO [C.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

cublockfun :: Fun
cublockfun x y = unsafePerformIO [CU.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

main = defaultMain
    [ bgroup ""
        [ bench "haskell +" $ nf (fun 1) 1
        , bench "c +" $ nf (cfun 1) 1
        , bench "cu +" $ nf (cufun 1) 1
        , bench "c block +" $ nf (cblockfun 1) 1
        , bench "cu block +" $ nf (cublockfun 1) 1
        ]
    ]

Ofenhed commented 4 years ago

Interesting tests. It seems to be partly related to lazy evaluation. I expanded your test a bit with tests that doesn't use lazy evaluation:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE QuasiQuotes #-}
import Criterion.Main
import qualified Language.C.Inline as C
import qualified Language.C.Inline.Unsafe as CU
import Data.Word
import System.IO.Unsafe

C.include "<stdint.h>"

type Fun = Word32 -> Word32 -> Word32
type IOFun = Word32 -> Word32 -> IO Word32

fun :: Fun
fun = (+)

cfun :: Fun
cfun x y = [C.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]

cufun :: Fun
cufun x y = [CU.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]

cblockfun :: Fun
cblockfun x y = unsafePerformIO [C.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

cublockfun :: Fun
cublockfun x y = unsafePerformIO [CU.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

cblockiofun :: IOFun
cblockiofun x y = [C.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

cublockiofun :: IOFun
cublockiofun x y = [CU.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]

main = defaultMain
    [ bgroup ""
        [ bench "haskell +" $ nf (fun 1) 1
        , bench "c +" $ nf (cfun 1) 1
        , bench "cu +" $ nf (cufun 1) 1
        , bench "unsafe c block +" $ nf (cblockfun 1) 1
        , bench "unsafe cu block +" $ nf (cublockfun 1) 1
        , bench "c IO block +" $ nfIO (cblockiofun 1 1)
        , bench "cu IO block +" $ nfIO (cublockiofun 1 1)
        ]
    ]

I get almost the same type of results (but actually not as bad for the unsafe functions on an old i7-2620M) as you, but removing the unsafePerformIO gives me Haskell-like performance for the CU.block.

benchmarking haskell +
time                 16.57 ns   (16.49 ns .. 16.68 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 16.73 ns   (16.61 ns .. 17.01 ns)
std dev              554.8 ps   (303.3 ps .. 995.0 ps)
variance introduced by outliers: 54% (severely inflated)

benchmarking c +
time                 395.9 ns   (391.7 ns .. 401.5 ns)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 399.1 ns   (395.1 ns .. 406.4 ns)
std dev              18.28 ns   (11.36 ns .. 31.15 ns)
variance introduced by outliers: 64% (severely inflated)

benchmarking cu +
time                 27.76 ns   (26.76 ns .. 29.92 ns)
                     0.979 R²   (0.948 R² .. 1.000 R²)
mean                 27.48 ns   (26.99 ns .. 28.96 ns)
std dev              2.789 ns   (607.3 ps .. 5.285 ns)
variance introduced by outliers: 92% (severely inflated)

benchmarking unsafe c block +
time                 396.2 ns   (393.8 ns .. 399.7 ns)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 395.8 ns   (394.1 ns .. 398.1 ns)
std dev              6.578 ns   (5.015 ns .. 8.649 ns)
variance introduced by outliers: 19% (moderately inflated)

benchmarking unsafe cu block +
time                 26.70 ns   (26.61 ns .. 26.82 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 27.08 ns   (26.90 ns .. 27.31 ns)
std dev              712.1 ps   (563.6 ps .. 949.6 ps)
variance introduced by outliers: 42% (moderately inflated)

benchmarking c IO block +
time                 397.2 ns   (394.4 ns .. 401.5 ns)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 402.7 ns   (398.4 ns .. 414.5 ns)
std dev              22.17 ns   (10.67 ns .. 40.93 ns)
variance introduced by outliers: 72% (severely inflated)

benchmarking cu IO block +
time                 17.18 ns   (17.13 ns .. 17.23 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.29 ns   (17.22 ns .. 17.37 ns)
std dev              255.5 ps   (200.3 ps .. 341.7 ps)
variance introduced by outliers: 19% (moderately inflated)

l29ah commented 4 years ago

Thanks for your findings! With unsafeDupablePerformIO used instead of unsafePerformIO the performance is much better (though still slower than Haskell). Maybe pure or its unsafe friend should use unsafeDupablePerformIO under the hood then?

bitonic commented 3 years ago

@l29ah I've merged #117, it is now released as 0.9.1.1 (thanks @Ofenhed !). Do you think we can close this?

Ofenhed commented 3 years ago

It increases the performance for pure slightly, but the overhead for safe functions is still very slow. That might be the cost of safe calls, I don't know, but I wouldn't say that this question is put to rest yet.

bitonic commented 3 years ago

OK, let's leave it open then -- I don't think I will have time to work on this but contributions are very welcome.

fpco / inline-c

Using threaded runtime destroys performance of C calls #115