Open l29ah opened 4 years ago
Interesting tests. It seems to be partly related to lazy evaluation. I expanded your test a bit with tests that doesn't use lazy evaluation:
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE QuasiQuotes #-}
import Criterion.Main
import qualified Language.C.Inline as C
import qualified Language.C.Inline.Unsafe as CU
import Data.Word
import System.IO.Unsafe
C.include "<stdint.h>"
type Fun = Word32 -> Word32 -> Word32
type IOFun = Word32 -> Word32 -> IO Word32
fun :: Fun
fun = (+)
cfun :: Fun
cfun x y = [C.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]
cufun :: Fun
cufun x y = [CU.pure| uint32_t { $(uint32_t x) + $(uint32_t y) }|]
cblockfun :: Fun
cblockfun x y = unsafePerformIO [C.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]
cublockfun :: Fun
cublockfun x y = unsafePerformIO [CU.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]
cblockiofun :: IOFun
cblockiofun x y = [C.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]
cublockiofun :: IOFun
cublockiofun x y = [CU.block| uint32_t { return $(uint32_t x) + $(uint32_t y); }|]
main = defaultMain
[ bgroup ""
[ bench "haskell +" $ nf (fun 1) 1
, bench "c +" $ nf (cfun 1) 1
, bench "cu +" $ nf (cufun 1) 1
, bench "unsafe c block +" $ nf (cblockfun 1) 1
, bench "unsafe cu block +" $ nf (cublockfun 1) 1
, bench "c IO block +" $ nfIO (cblockiofun 1 1)
, bench "cu IO block +" $ nfIO (cublockiofun 1 1)
]
]
I get almost the same type of results (but actually not as bad for the unsafe functions on an old i7-2620M) as you, but removing the unsafePerformIO gives me Haskell-like performance for the CU.block.
benchmarking haskell +
time 16.57 ns (16.49 ns .. 16.68 ns)
1.000 R² (0.999 R² .. 1.000 R²)
mean 16.73 ns (16.61 ns .. 17.01 ns)
std dev 554.8 ps (303.3 ps .. 995.0 ps)
variance introduced by outliers: 54% (severely inflated)
benchmarking c +
time 395.9 ns (391.7 ns .. 401.5 ns)
0.998 R² (0.996 R² .. 1.000 R²)
mean 399.1 ns (395.1 ns .. 406.4 ns)
std dev 18.28 ns (11.36 ns .. 31.15 ns)
variance introduced by outliers: 64% (severely inflated)
benchmarking cu +
time 27.76 ns (26.76 ns .. 29.92 ns)
0.979 R² (0.948 R² .. 1.000 R²)
mean 27.48 ns (26.99 ns .. 28.96 ns)
std dev 2.789 ns (607.3 ps .. 5.285 ns)
variance introduced by outliers: 92% (severely inflated)
benchmarking unsafe c block +
time 396.2 ns (393.8 ns .. 399.7 ns)
0.998 R² (0.997 R² .. 0.999 R²)
mean 395.8 ns (394.1 ns .. 398.1 ns)
std dev 6.578 ns (5.015 ns .. 8.649 ns)
variance introduced by outliers: 19% (moderately inflated)
benchmarking unsafe cu block +
time 26.70 ns (26.61 ns .. 26.82 ns)
1.000 R² (0.999 R² .. 1.000 R²)
mean 27.08 ns (26.90 ns .. 27.31 ns)
std dev 712.1 ps (563.6 ps .. 949.6 ps)
variance introduced by outliers: 42% (moderately inflated)
benchmarking c IO block +
time 397.2 ns (394.4 ns .. 401.5 ns)
0.998 R² (0.995 R² .. 1.000 R²)
mean 402.7 ns (398.4 ns .. 414.5 ns)
std dev 22.17 ns (10.67 ns .. 40.93 ns)
variance introduced by outliers: 72% (severely inflated)
benchmarking cu IO block +
time 17.18 ns (17.13 ns .. 17.23 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 17.29 ns (17.22 ns .. 17.37 ns)
std dev 255.5 ps (200.3 ps .. 341.7 ps)
variance introduced by outliers: 19% (moderately inflated)
Thanks for your findings! With unsafeDupablePerformIO used instead of unsafePerformIO the performance is much better (though still slower than Haskell). Maybe pure
or its unsafe friend should use unsafeDupablePerformIO under the hood then?
@l29ah I've merged #117, it is now released as 0.9.1.1 (thanks @Ofenhed !). Do you think we can close this?
It increases the performance for pure
slightly, but the overhead for safe functions is still very slow. That might be the cost of safe calls, I don't know, but I wouldn't say that this question is put to rest yet.
OK, let's leave it open then -- I don't think I will have time to work on this but contributions are very welcome.
76ns call overhead on a modern 3.5GHz i7 is just insane. That's 266000 cycles! Disabling -N makes it much less (but still more than Haskell), but then i can't meaningfully use threads in the application.
Code: