composewell / streamly

High performance, concurrent functional programming abstractions
https://streamly.composewell.com
Other
856 stars 64 forks source link

Bad performance on `Array a` and `MutArray a` while using as `Ptr` #2589

Open TheKK opened 11 months ago

TheKK commented 11 months ago

Since I observed write performance issue in my program, I found that it was caused by Array Word8 accidentally.

Here's the benchmark that could reproduce it. https://gist.github.com/TheKK/251fc1fe24600165ce1b2db922f1ac2d

I tried implementing asUnsafePtr without MonadIO m constraint as a reference. Below are the results from my laptop:

± cabal run -O2 bench:cdc -- +RTS -s -T -RTS --regress cycles:iters --regress mutatorCpuSeconds:iters --regress gcCpuSeconds:iters --regress allocated:iters
Up to date
benchmarking write/Array
time                 260.8 μs   (255.9 μs .. 266.8 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 257.1 μs   (255.1 μs .. 259.5 μs)
std dev              7.116 μs   (5.600 μs .. 10.51 μs)
cycles:              0.998 R²   (0.997 R² .. 0.999 R²)
  iters              572419.159 (562076.184 .. 585264.385)
  y                  -1883493.983 (-3370725.280 .. -642113.082)
mutatorCpuSeconds:   0.998 R²   (0.997 R² .. 0.999 R²)
  iters              2.571e-4   (2.521e-4 .. 2.635e-4)
  y                  -7.855e-4  (-1.474e-3 .. -1.975e-4)
gcCpuSeconds:        0.991 R²   (0.984 R² .. 0.995 R²)
  iters              9.898e-6   (9.618e-6 .. 1.014e-5)
  y                  -3.252e-5  (-7.490e-5 .. 1.191e-5)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              693884.116 (693882.854 .. 693885.515)
  y                  2282.946   (1758.990 .. 2816.548)
variance introduced by outliers: 21% (moderately inflated)

benchmarking write/ByteString
time                 146.7 μs   (142.4 μs .. 152.9 μs)
                     0.992 R²   (0.988 R² .. 0.998 R²)
mean                 146.1 μs   (143.6 μs .. 149.5 μs)
std dev              9.761 μs   (7.391 μs .. 12.28 μs)
cycles:              0.992 R²   (0.988 R² .. 0.998 R²)
  iters              321951.335 (312722.537 .. 334991.437)
  y                  -1406715.340 (-3578381.007 .. 136008.990)
mutatorCpuSeconds:   0.993 R²   (0.989 R² .. 0.998 R²)
  iters              1.457e-4   (1.417e-4 .. 1.517e-4)
  y                  -5.871e-4  (-1.524e-3 .. 7.918e-5)
gcCpuSeconds:        0.910 R²   (0.880 R² .. 0.945 R²)
  iters              4.400e-6   (3.879e-6 .. 5.019e-6)
  y                  -6.004e-5  (-1.586e-4 .. 2.950e-5)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              373790.274 (373789.508 .. 373791.086)
  y                  2574.756   (2152.690 .. 2998.360)
variance introduced by outliers: 65% (severely inflated)

benchmarking asPtr/Array
time                 118.8 ns   (116.9 ns .. 120.8 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 118.0 ns   (116.6 ns .. 119.2 ns)
std dev              4.218 ns   (3.329 ns .. 5.294 ns)
cycles:              0.999 R²   (0.998 R² .. 0.999 R²)
  iters              260.828    (256.445 .. 265.217)
  y                  -225886.094 (-556091.679 .. 75566.247)
mutatorCpuSeconds:   0.999 R²   (0.997 R² .. 0.999 R²)
  iters              1.184e-7   (1.164e-7 .. 1.203e-7)
  y                  -8.632e-5  (-2.252e-4 .. 5.290e-5)
gcCpuSeconds:        0.991 R²   (0.986 R² .. 0.997 R²)
  iters              6.501e-10  (6.253e-10 .. 6.798e-10)
  y                  -1.155e-6  (-3.425e-6 .. 8.469e-7)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              312.000    (312.000 .. 312.001)
  y                  2983.914   (2796.086 .. 3182.479)
variance introduced by outliers: 55% (severely inflated)

benchmarking asPtr/Array'
time                 10.91 ns   (10.83 ns .. 10.98 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 10.85 ns   (10.72 ns .. 10.94 ns)
std dev              346.9 ps   (245.6 ps .. 444.4 ps)
cycles:              1.000 R²   (0.999 R² .. 1.000 R²)
  iters              23.951     (23.792 .. 24.105)
  y                  -94395.670 (-226486.084 .. 24880.470)
mutatorCpuSeconds:   1.000 R²   (0.999 R² .. 1.000 R²)
  iters              1.090e-8   (1.082e-8 .. 1.097e-8)
  y                  -3.215e-5  (-8.781e-5 .. 2.575e-5)
gcCpuSeconds:        0.993 R²   (0.989 R² .. 0.997 R²)
  iters              3.418e-11  (3.331e-11 .. 3.534e-11)
  y                  -5.555e-7  (-1.316e-6 .. 1.362e-7)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              16.000     (16.000 .. 16.000)
  y                  2989.826   (2810.229 .. 3178.715)
variance introduced by outliers: 53% (severely inflated)

benchmarking asPtr/ByteString
time                 10.16 ns   (10.07 ns .. 10.26 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 10.10 ns   (10.01 ns .. 10.21 ns)
std dev              330.1 ps   (251.7 ps .. 417.1 ps)
cycles:              0.999 R²   (0.998 R² .. 0.999 R²)
  iters              22.290     (22.099 .. 22.558)
  y                  65680.770  (-130834.465 .. 281810.287)
mutatorCpuSeconds:   0.999 R²   (0.998 R² .. 0.999 R²)
  iters              1.016e-8   (1.007e-8 .. 1.028e-8)
  y                  3.956e-5   (-5.135e-5 .. 1.400e-4)
gcCpuSeconds:        NaN R²     (NaN R² .. NaN R²)
  iters              0.000      (0.000 .. 0.000)
  y                  0.000      (0.000 .. 0.000)
allocated:           0.000 R²   (0.000 R² .. 0.017 R²)
  iters              3.631e-7   (-4.365e-5 .. 4.944e-5)
  y                  2991.437   (2807.870 .. 3177.496)
variance introduced by outliers: 55% (severely inflated)

Marking arr_asPtrUnsafe and ma_asPtrUnsafe as NOINLINE makes benchmark, "asPtr/Array", drop to 15 ns. Still pretty fast comparing to "asPtr/Array" which is 118 ns.

By looking at the CORE, it's obvious that Array.asPtrUnsafe was called by worker wrapper which means one level of indirection.

Main.$wf1
  = \ (ww_sa1F
         :: ghc-prim:GHC.Prim.MutableByteArray# ghc-prim:GHC.Prim.RealWorld)
      (ww1_sa1G :: ghc-prim:GHC.Prim.Int#) ->
      Streamly.Internal.Data.Array.Mut.Type.$wasPtrUnsafe     <== THIS
        @IO
        @Word8
        @()
        Control.Monad.IO.Class.$fMonadIOIO                       <== THIS
        ww_sa1F
        ww1_sa1G
        (Main.main25
         `cast` (<Ptr Word8>_R
                 %<'Many>_N ->_R Sym (ghc-prim:GHC.Types.N:IO[0] <()>_R)
                 :: (Ptr Word8
                     -> ghc-prim:GHC.Prim.State# ghc-prim:GHC.Prim.RealWorld
                     -> (# ghc-prim:GHC.Prim.State# ghc-prim:GHC.Prim.RealWorld, () #))
                    ~R# (Ptr Word8 -> IO ())))

So this should be a real issue that affect all operations of Array relates to its Ptr interface.

harendra-kumar commented 11 months ago

Thanks @TheKK for looking into this deeply.

I am wondering if this gets resolved if we write asPtrUnsafe in MutArray/Type.hs as follows:

asPtrUnsafe :: MonadIO m => MutArray a -> (Ptr a -> IO b) -> m b
asPtrUnsafe arr f = liftIO $ do
  contents <- Unboxed.pin $ arrContents arr
  let !ptr = Ptr (byteArrayContents#
                     (unsafeCoerce# (getMutableByteArray# contents)))

  r <- f (ptr `plusPtr` arrStart arr)
  touch contents
  return r             

Basically, just change the function f type to use IO and move liftIO at the top level of this function. If that works, it is a minimal change. Or maybe we can just return IO and remove the MonadIO constraint entirely. Because there not much point using liftIO in this API now, if the user wants they can use it themsleves.

We can also add a benchmark for this.

TheKK commented 11 months ago

The solution you suggested sound good to me. I think we could keep MonadIO constraint there so the interface of asPtrUnsafe is intact even though it's pre-release API.

harendra-kumar commented 11 months ago

I would appreciate if you can find time to create a PR about this.

TheKK commented 11 months ago

Sure!

harendra-kumar commented 1 month ago

We can add new APIs unsafeAsPtrIO to use IO monad directly, as we already have unsafeAsPtr.