Mikolaj / horde-ad

Higher Order Reverse Derivatives Efficiently - Automatic Differentiation library based on the paper "Provably correct, asymptotically efficient, higher-order reverse-mode automatic differentiation"
BSD 3-Clause "New" or "Revised" License
33 stars 6 forks source link

[Blocker] Performance has broken down after the last two rewrites #20

Closed Mikolaj closed 1 year ago

Mikolaj commented 2 years ago

In short, while we couldn't monitor performance, because the code only worked with 9.2.2 (#13) and 9.2.2 (and HEAD) had (has) degraded and unpredictable performance (#14), our code got much slower. Neither of the three kinds of heap profiling (-h, -p, -evenlog) show much and bisecting git history to pinpoint the offending changes is impossible, because the code didn't work with good GHC versions.

From branch ghc-report-specialize to today's master (snapshot on branch issue-20) our BENCH2 test from #14 got twice slower and with three times more allocation on GHC 8.10.7 and 9.0.2, surpassing the previous slowdown in GHC 9.2.2, which got somewhat slower and more bloated still. Test BENCH1 slowdown, meanwhile, which was previously worked around with spurious SPECIALIZE pragmas, is not worked around on GHC 9.0.2 (only) any more, though in a less pronounced way, with only 4 times slowdown and 50 times higher allocation.

This time. HEAD can't be used for master branch, because the type-level helper plugins we use don't compile even with head.hackage.

I do hope all this gets fixed once #14 is fixed and, ideally, the fix gets backported to 9.2.3. But I wonder how to proceed till then. Any work on optimizations we discussed on meetings during last two weeks is pointless when they can't be measured. Any other work lengthens the trail of commits that need to be benchmarked once we have the situation under control, to verify their performance impact and undo or modify some design decisions (if the impact is significant) and then rebase any further work on that.

I tried some extra proxies and SPECIALIZE pragmas in https://github.com/Mikolaj/horde-ad/tree/extra-pragmas-BENCH2 but to no avail for BENCH1 and with limited success for BENCH2 (and no success at all in 9.2.2). How else one can work around lack of specialization (most likely due to worker-wrapper, but disabling worker-wrapper kills performance almost as badly, see #14)?

Edit: TODO after all unreliable performance issues are fixed: track down and remove any Proxy that was introduced to force specialization via pragma or, if any, via just supplying a concrete type.

awf commented 2 years ago

@simonpj and @goldfirere this may need your input?

Mikolaj commented 2 years ago

According to what we planned during our meeting today, I've started a two-fold investigation of the performance drop:

Some observations: the allocation is unchanged except for four functions that allocate all the extra heap space, the biggest three of which are easy to pinpoint, because they are the right hand sides of pattern matching in the case expression in the three last lines of this snippet:

https://github.com/Mikolaj/horde-ad/blob/b3753621399e9d0baaf87e7d1e23ac866c67577e/src/HordeAd/Internal/Delta.hs#L380-L385

Regarding the slowdown, it seems to just follow the blowup, because the same functions seem to be called the same number of times, just taking proportionally more time if they are entangled in the spurious allocations. E.g., here's the function corresponding to pattern Add0 in profiles before and after:

COST CENTRE                   MODULE                          SRC                                                          no.      entries  %time %alloc   %time %alloc (inherited)

buildVectors.eval0.\          HordeAd.Core.Delta              src/HordeAd/Core/Delta.hs:174:26-47                          1751   272510500    3.8    0.0     3.8    0.0

buildFinMaps.eval0.\          HordeAd.Internal.Delta          src/HordeAd/Internal/Delta.hs:385:21-42                      1837   272510500   12.3   22.6    27.7   38.1

Profiling remarks from 2 weeks ago in the middle section of https://github.com/Mikolaj/horde-ad/issues/14#issuecomment-1062250358 show the same pattern, so this is not caused by the recent overhaul of Dual class typing, but has been introduced at some point between four and two weeks ago. If that's closer to four weeks ago, it may be possible to bisect and pinpoint the offending commit (I think I only checked performance with 9.2.2 at that time, so I missed it), if closer to two weeks ago, bisecting wouldn't be possible, because we couldn't use 8.10.7 and 9.0.2 during this period and 9.2.2 was slow both before and after.

Mikolaj commented 2 years ago

I've looked at Core code obtained via {-# OPTIONS_GHC -ddump-stg #-} from file Delta.hs. Perhaps it's not the right thing to look at, because it's not yet specialized. Given that the core for the Add0 case looks the same, the slowdown probably comes from specialization or simplification when this function is called elsewhere and so there's no important difference in the core below. We'd need to obtain core code from where the function is used. Anyway, here's the first step.

After the code overhaul (when it became slow) it's (note ww85_s2eHQ)

...
        case ww6_s2eGr of {
        Internal.Numeric.C:Product ww75_s2eHG [Occ=Once1!]
                                   _ [Occ=Dead]
                                   _ [Occ=Dead]
                                   _ [Occ=Dead]
                                   _ [Occ=Dead]
                                   _ [Occ=Dead]
                                   _ [Occ=Dead] ->
        case ww75_s2eHG of {
        GHC.Num.C:Num ww83_s2eHO
                      _ [Occ=Dead]
                      ww85_s2eHQ
                      _ [Occ=Dead]
                      _ [Occ=Dead]
                      _ [Occ=Dead]
                      ww89_s2eHU ->
...
          eval0_s2eIa [Occ=LoopBreaker]
            :: r_aWpP
               -> HordeAd.Internal.Delta.Delta0 r_aWpP
               -> GHC.ST.STRep GHC.Prim.RealWorld ()
          [LclId, Arity=3, Str=<S,1*U><S,1*U><L,U>, Unf=OtherCon []] =
              \r [r1_s2gk5 eta6_s2gk6 eta7_s2gk7]
                  case r1_s2gk5 of r2_s2gk8 {
                  __DEFAULT ->
                  case eta6_s2gk6 of {
                    HordeAd.Internal.Delta.Zero0 -> (#,#) [eta7_s2gk7 GHC.Tuple.()];
                    HordeAd.Internal.Delta.Scale0 k_s2gka [Occ=Once1]
                                                  d_s2gkb [Occ=Once1] ->
                        case ww85_s2eHQ k_s2gka r2_s2gk8 of sat_s2gkc [Occ=Once1] {
                        __DEFAULT -> eval0_s2eIa sat_s2gkc d_s2gkb eta7_s2gk7;
                        };
                    HordeAd.Internal.Delta.Add0 d_s2gkd [Occ=Once1]
                                                e_s2gke [Occ=Once1] ->
                        case eval0_s2eIa r2_s2gk8 d_s2gkd eta7_s2gk7 of {
                        (#,#) ipv2_s2gkg [Occ=Once1] _ [Occ=Dead] ->
                        eval0_s2eIa r2_s2gk8 e_s2gke ipv2_s2gkg;
                        };

while before the code overhaul (when it was still fast) it's (note GHC.Num.* instead of ww85_s2eHQ; other than that it seems the same)

eval0_spgW [Occ=LoopBreaker]
  :: r_agec
     -> HordeAd.Core.Delta.DeltaScalar
          r_agec
     -> GHC.ST.STRep
          GHC.Prim.RealWorld
          ()
[LclId,
 Arity=3,
 Str=<S,1*U><S,1*U><L,U>,
 Unf=OtherCon []] =
    \r [r1_sptN
        eta5_sptO
        eta6_sptP]
        case
            r1_sptN
        of
        r2_sptQ
        {
        __DEFAULT ->
        case
            eta5_sptO
        of
        { HordeAd.Core.Delta.ZeroScalar ->
              (#,#) [eta6_sptP
                     GHC.Tuple.()];
          HordeAd.Core.Delta.ScaleScalar k_sptS [Occ=Once1]
                                         d_sptT [Occ=Once1] ->
              case
                  GHC.Num.*
                      $dNum1_spgf
                      k_sptS
                      r2_sptQ
              of
              sat_sptU [Occ=Once1]
              {
              __DEFAULT ->
              eval0_spgW
                  sat_sptU
                  d_sptT
                  eta6_sptP;
              };
          HordeAd.Core.Delta.AddScalar d_sptV [Occ=Once1]
                                       e_sptW [Occ=Once1] ->
              case
                  eval0_spgW
                      r2_sptQ
                      d_sptV
                      eta6_sptP
              of
              {
              (#,#) ipv8_sptY [Occ=Once1]
                    _ [Occ=Dead] ->
              eval0_spgW
                  r2_sptQ
                  e_sptW
                  ipv8_sptY;
              };

For completeness, below are all occurences of eval0.

Before (fast) it's

``` eval0_spgW [Occ=LoopBreaker] :: r_agec -> HordeAd.Core.Delta.DeltaScalar r_agec -> GHC.ST.STRep GHC.Prim.RealWorld () [LclId, Arity=3, Str=, Unf=OtherCon []] = \r [r1_sptN eta5_sptO eta6_sptP] case r1_sptN of r2_sptQ { __DEFAULT -> case eta5_sptO of { HordeAd.Core.Delta.ZeroScalar -> (#,#) [eta6_sptP GHC.Tuple.()]; HordeAd.Core.Delta.ScaleScalar k_sptS [Occ=Once1] d_sptT [Occ=Once1] -> case GHC.Num.* $dNum1_spgf k_sptS r2_sptQ of sat_sptU [Occ=Once1] { __DEFAULT -> eval0_spgW sat_sptU d_sptT eta6_sptP; }; HordeAd.Core.Delta.AddScalar d_sptV [Occ=Once1] e_sptW [Occ=Once1] -> case eval0_spgW r2_sptQ d_sptV eta6_sptP of { (#,#) ipv8_sptY [Occ=Once1] _ [Occ=Dead] -> eval0_spgW r2_sptQ e_sptW ipv8_sptY; }; ... eval0_ssP6 [Occ=LoopBreaker] :: r_agec -> HordeAd.Core.Delta.DeltaScalar r_agec -> GHC.ST.STRep GHC.Prim.RealWorld () [LclId, Arity=3, Str=, Unf=OtherCon []] = \r [r1_st1X eta5_st1Y eta6_st1Z] case r1_st1X of r2_st20 { __DEFAULT -> case eta5_st1Y of { HordeAd.Core.Delta.ZeroScalar -> (#,#) [eta6_st1Z GHC.Tuple.()]; HordeAd.Core.Delta.ScaleScalar k_st22 [Occ=Once1] d_st23 [Occ=Once1] -> case GHC.Num.* $dNum1_ssOp k_st22 r2_st20 of sat_st24 [Occ=Once1] { __DEFAULT -> eval0_ssP6 sat_st24 d_st23 eta6_st1Z; }; HordeAd.Core.Delta.AddScalar d_st25 [Occ=Once1] e_st26 [Occ=Once1] -> case eval0_ssP6 r2_st20 d_st25 eta6_st1Z of { (#,#) ipv8_st28 [Occ=Once1] _ [Occ=Dead] -> eval0_ssP6 r2_st20 e_st26 ipv8_st28; }; ```

while after (slow, with lots of SPECIALIZE to Double in order to eliminate the possibility that the BENCH1 bug is the culprit; perhaps we can ignore the last two of four eval0 versions that come from that) it's

``` eval0_s2eIa [Occ=LoopBreaker] :: r_aWpP -> HordeAd.Internal.Delta.Delta0 r_aWpP -> GHC.ST.STRep GHC.Prim.RealWorld () [LclId, Arity=3, Str=, Unf=OtherCon []] = \r [r1_s2gk5 eta6_s2gk6 eta7_s2gk7] case r1_s2gk5 of r2_s2gk8 { __DEFAULT -> case eta6_s2gk6 of { HordeAd.Internal.Delta.Zero0 -> (#,#) [eta7_s2gk7 GHC.Tuple.()]; HordeAd.Internal.Delta.Scale0 k_s2gka [Occ=Once1] d_s2gkb [Occ=Once1] -> case ww85_s2eHQ k_s2gka r2_s2gk8 of sat_s2gkc [Occ=Once1] { __DEFAULT -> eval0_s2eIa sat_s2gkc d_s2gkb eta7_s2gk7; }; HordeAd.Internal.Delta.Add0 d_s2gkd [Occ=Once1] e_s2gke [Occ=Once1] -> case eval0_s2eIa r2_s2gk8 d_s2gkd eta7_s2gk7 of { (#,#) ipv2_s2gkg [Occ=Once1] _ [Occ=Dead] -> eval0_s2eIa r2_s2gk8 e_s2gke ipv2_s2gkg; }; ... eval0_s2FR3 [Occ=LoopBreaker] :: r_aWpP -> HordeAd.Internal.Delta.Delta0 r_aWpP -> GHC.ST.STRep GHC.Prim.RealWorld () [LclId, Arity=3, Str=, Unf=OtherCon []] = \r [r1_s2HsY eta6_s2HsZ eta7_s2Ht0] case r1_s2HsY of r2_s2Ht1 { __DEFAULT -> case eta6_s2HsZ of { HordeAd.Internal.Delta.Zero0 -> (#,#) [eta7_s2Ht0 GHC.Tuple.()]; HordeAd.Internal.Delta.Scale0 k_s2Ht3 [Occ=Once1] d_s2Ht4 [Occ=Once1] -> case ww85_s2FQJ k_s2Ht3 r2_s2Ht1 of sat_s2Ht5 [Occ=Once1] { __DEFAULT -> eval0_s2FR3 sat_s2Ht5 d_s2Ht4 eta7_s2Ht0; }; HordeAd.Internal.Delta.Add0 d_s2Ht6 [Occ=Once1] e_s2Ht7 [Occ=Once1] -> case eval0_s2FR3 r2_s2Ht1 d_s2Ht6 eta7_s2Ht0 of { (#,#) ipv2_s2Ht9 [Occ=Once1] _ [Occ=Dead] -> eval0_s2FR3 r2_s2Ht1 e_s2Ht7 ipv2_s2Ht9; }; ... $weval0_s2cLc [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker] :: GHC.Prim.Double# -> HordeAd.Internal.Delta.Delta0 GHC.Types.Double -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) [LclId, Arity=3, Str=, Unf=OtherCon []] = \r [ww9_s2ea6 w9_s2ea7 w10_s2ea8] case w9_s2ea7 of { HordeAd.Internal.Delta.Zero0 -> (#,#) [w10_s2ea8 GHC.Tuple.()]; HordeAd.Internal.Delta.Scale0 k_s2eaa [Occ=Once1!] d_s2eab [Occ=Once1] -> case k_s2eaa of { GHC.Types.D# x2_s2ead [Occ=Once1] -> case *## [x2_s2ead ww9_s2ea6] of sat_s2eae [Occ=Once1] { __DEFAULT -> $weval0_s2cLc sat_s2eae d_s2eab w10_s2ea8; }; }; HordeAd.Internal.Delta.Add0 d_s2eaf [Occ=Once1] e_s2eag [Occ=Once1] -> case $weval0_s2cLc ww9_s2ea6 d_s2eaf w10_s2ea8 of { (#,#) ipv9_s2eai [Occ=Once1] _ [Occ=Dead] -> $weval0_s2cLc ww9_s2ea6 e_s2eag ipv9_s2eai; }; ... let { Rec { $weval0_s2DU5 [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker] :: GHC.Prim.Double# -> HordeAd.Internal.Delta.Delta0 GHC.Types.Double -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #) [LclId, Arity=3, Str=, Unf=OtherCon []] = \r [ww9_s2FiZ w9_s2Fj0 w10_s2Fj1] case w9_s2Fj0 of { HordeAd.Internal.Delta.Zero0 -> (#,#) [w10_s2Fj1 GHC.Tuple.()]; HordeAd.Internal.Delta.Scale0 k_s2Fj3 [Occ=Once1!] d_s2Fj4 [Occ=Once1] -> case k_s2Fj3 of { GHC.Types.D# x2_s2Fj6 [Occ=Once1] -> case *## [x2_s2Fj6 ww9_s2FiZ] of sat_s2Fj7 [Occ=Once1] { __DEFAULT -> $weval0_s2DU5 sat_s2Fj7 d_s2Fj4 w10_s2Fj1; }; }; HordeAd.Internal.Delta.Add0 d_s2Fj8 [Occ=Once1] e_s2Fj9 [Occ=Once1] -> case $weval0_s2DU5 ww9_s2FiZ d_s2Fj8 w10_s2Fj1 of { (#,#) ipv9_s2Fjb [Occ=Once1] _ [Occ=Dead] -> $weval0_s2DU5 ww9_s2FiZ e_s2Fj9 ipv9_s2Fjb; }; ```
Mikolaj commented 2 years ago

I'm now pausing the investigation until I get any feedback or the fix to #14 is available for testing. I'm sooo perplexed. The code that specializes is slow, the one that (erroneously) doesn't, is fast. Perhaps when #14 is fixed, both will be specialized (without much changed performance) and so it won't look so weird.

I've looked at Core code output of {-# OPTIONS_GHC -ddump-stg #-} in HordeAd.Core.Optimizer where eval0 is used and the code is specialized to Double in the after (slow) variant (probably because I've added the SPECIALIZE pragmas). It looks legit, with Double unboxing, etc. But it's slow.

$weval0_sLUO [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
  :: GHC.Prim.Double#
     -> HordeAd.Internal.Delta.Delta0
          (HordeAd.Core.DualClass.Primal
             (HordeAd.Internal.Delta.Delta0 GHC.Types.Double))
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
[LclId, Arity=3, Str=<L,U><S,1*U><L,U>, Unf=OtherCon []] =
    \r [ww5_sLUU w5_sLUV w6_sLUW]
        case w5_sLUV of {
          HordeAd.Internal.Delta.Zero0 -> (#,#) [w6_sLUW GHC.Tuple.()];
          HordeAd.Internal.Delta.Scale0 k_sLUY [Occ=Once1!]
                                        d_sLUZ [Occ=Once1] ->
              case k_sLUY of {
              GHC.Types.D# x_sLV1 [Occ=Once1] ->
              case *## [x_sLV1 ww5_sLUU] of sat_sLV2 [Occ=Once1] {
              __DEFAULT -> $weval0_sLUO sat_sLV2 d_sLUZ w6_sLUW;
              };
              };
          HordeAd.Internal.Delta.Add0 d_sLV3 [Occ=Once1]
                                      e_sLV4 [Occ=Once1] ->
              case $weval0_sLUO ww5_sLUU d_sLV3 w6_sLUW of {
              (#,#) ipv2_sLV6 [Occ=Once1] _ [Occ=Dead] ->
              $weval0_sLUO ww5_sLUU e_sLV4 ipv2_sLV6;
              };

The Core code output in the before (fast) version has no specialization (which is a bug), even though the required unfolding seems present in the .hi file. But it's fast.

case
    HordeAd.Core.Delta.evalBindings
        $dEq_s1n89
        $dNumeric_s1n8k
        $dNum_s1n8i
        sat_s1n9b
        sat_s1n9g
        sat_s1n9l
        st_s1n93
        d_s1n96
of

Given that, the actual core used for the before (fast) variant is probably the same as in the comment above, which seems very similar, except for the Double specialization and so unboxing. Two things stand out, but I have no idea if they matter: in the code from previous comment, there is Str=<S,1*U><S,1*U><L,U>, while in the specialized code we have Str=<L,U><S,1*U><L,U> and there is a silly type in the specialized code (HordeAd.Core.DualClass.Primal (HordeAd.Internal.Delta.Delta0 GHC.Types.Double)), which should just be Double (but compiler knows that and unboxes as expected).

Edit: but why is the Double boxed inside the Scale0 constructor? It should be unpacked! Isn't the datatype specialized to Double? Is parameterized datatype not a zero-cost abstraction? Would UNPACK pragma help? [edit: I now remember this is an old and known problem and, e.g., https://hackage.haskell.org/package/unpacked-containers works around it; I've opened #23 for that.]

Edit2: adding Strict to Delta.hs didn't change anything, so it's not strictness (StrictData is from day 1 in the .cabal file).

Mikolaj commented 2 years ago

I'm desperate, because we are over 2 months without reliable benchmarks and again today I refrained from a refactoring, because I have no way of reliably measuring if it's acceptable performance-wise (it introduces some GADTs). Could somebody please have a look at what I've found so far and offer any feedback? #14 is still not fixed, but I can't wait any longer to see if #14 is the culprit or obscures the investigation. I'm going to spend some more time on this ticket, but all this time will be wasted if I'm missing something obvious. Please help.

tomjaguarpaw commented 2 years ago

@simonpj and @goldfirere this may need your input?

I agree with @awf here. I'm afraid my GHC knowledge is insufficient to help with this.

goldfirere commented 2 years ago

This kind of thing is definitely more in @simonpj's wheelhouse than mine. But I'm happy to discuss on a call including Simon -- I can learn new things. Is that the best way forward?

Mikolaj commented 2 years ago

Results from a new series of tests. Some of these have plugins ripped off and so can include results with HEAD and with https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 applied to HEAD --- ghc-9.5.20220606 is HEAD from 20220608 and ghc-9.5.20220608 additionally has https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 applied.

Branch ghc-report-specialize, the baseline. This also exactly reproduces the measurements recorded in #14 (but note that ghc-9.3.20220316 and ghc-9.5.20220606 differ a lot):

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-8.10.7/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  29,969,255,120 bytes allocated in the heap
  38,317,259,072 bytes copied during GC
     124,406,080 bytes maximum residency (227 sample(s))
       2,922,560 bytes maximum slop
             274 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22944 colls,     0 par   11.853s  11.909s     0.0005s    0.0016s
  Gen  1       227 colls,     0 par    3.197s   3.210s     0.0141s    0.0296s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    7.819s  (  7.844s elapsed)
  GC      time   15.050s  ( 15.119s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   22.869s  ( 22.963s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,832,835,823 bytes per MUT second

  Productivity  34.2% of total user, 34.2% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  29,993,632,336 bytes allocated in the heap
  36,535,814,368 bytes copied during GC
     119,503,384 bytes maximum residency (253 sample(s))
       2,919,120 bytes maximum slop
             268 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22918 colls,     0 par   11.871s  11.882s     0.0005s    0.0020s
  Gen  1       253 colls,     0 par    1.945s   1.946s     0.0077s    0.0311s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    7.887s  (  7.885s elapsed)
  GC      time   13.816s  ( 13.828s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   21.703s  ( 21.714s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,802,719,039 bytes per MUT second

  Productivity  36.3% of total user, 36.3% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  73,220,358,784 bytes allocated in the heap
       1,808,744 bytes copied during GC
     119,593,208 bytes maximum residency (201 sample(s))
       2,928,528 bytes maximum slop
             262 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     16080 colls,     0 par   13.622s  13.636s     0.0008s    0.0027s
  Gen  1       201 colls,     0 par    2.961s   2.962s     0.0147s    0.0284s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   26.252s  ( 26.256s elapsed)
  GC      time   16.583s  ( 16.598s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   42.835s  ( 42.854s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,789,111,099 bytes per MUT second

  Productivity  61.3% of total user, 61.3% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  34,265,500,784 bytes allocated in the heap
         346,392 bytes copied during GC
     120,458,736 bytes maximum residency (190 sample(s))
       2,865,072 bytes maximum slop
             276 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6595 colls,     0 par   14.210s  14.228s     0.0022s    0.0065s
  Gen  1       190 colls,     0 par    3.122s   3.125s     0.0164s    0.0308s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.590s  (  8.585s elapsed)
  GC      time   17.332s  ( 17.354s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time   25.922s  ( 25.940s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,988,804,129 bytes per MUT second

  Productivity  33.1% of total user, 33.1% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  34,036,245,656 bytes allocated in the heap
         354,304 bytes copied during GC
     123,780,712 bytes maximum residency (185 sample(s))
       2,890,416 bytes maximum slop
             269 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6601 colls,     0 par   13.873s  13.889s     0.0021s    0.0058s
  Gen  1       185 colls,     0 par    3.077s   3.080s     0.0166s    0.0308s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.171s  (  8.164s elapsed)
  GC      time   16.950s  ( 16.969s elapsed)
  EXIT    time    0.000s  (  0.007s elapsed)
  Total   time   25.121s  ( 25.140s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,165,636,535 bytes per MUT second

  Productivity  32.5% of total user, 32.5% of total elapsed

The original branch issue-20 at which the problem has been spotted:

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-8.10.7/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  92,986,374,864 bytes allocated in the heap
  40,901,420,272 bytes copied during GC
     123,422,176 bytes maximum residency (225 sample(s))
       2,875,704 bytes maximum slop
             272 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     82454 colls,     0 par   12.759s  12.787s     0.0002s    0.0015s
  Gen  1       225 colls,     0 par    4.744s   4.749s     0.0211s    0.0323s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   28.328s  ( 28.341s elapsed)
  GC      time   17.503s  ( 17.536s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   45.830s  ( 45.877s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,282,546,926 bytes per MUT second

  Productivity  61.8% of total user, 61.8% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  93,000,436,368 bytes allocated in the heap
  40,616,029,992 bytes copied during GC
     123,421,896 bytes maximum residency (229 sample(s))
       2,879,376 bytes maximum slop
             272 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     82450 colls,     0 par   12.613s  12.648s     0.0002s    0.0014s
  Gen  1       229 colls,     0 par    4.547s   4.555s     0.0199s    0.0332s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   30.486s  ( 30.526s elapsed)
  GC      time   17.160s  ( 17.203s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   47.645s  ( 47.729s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,050,631,792 bytes per MUT second

  Productivity  64.0% of total user, 64.0% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
 106,033,830,776 bytes allocated in the heap
       2,666,720 bytes copied during GC
     123,990,232 bytes maximum residency (224 sample(s))
       2,902,184 bytes maximum slop
             285 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     23557 colls,     0 par   14.373s  14.389s     0.0006s    0.0026s
  Gen  1       224 colls,     0 par    3.833s   3.836s     0.0171s    0.0315s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   35.800s  ( 35.812s elapsed)
  GC      time   18.207s  ( 18.225s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   54.007s  ( 54.037s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,961,877,948 bytes per MUT second

  Productivity  66.3% of total user, 66.3% of total elapsed

and branch issue-20-without-plugins tweaked only just enough to run HEAD on it (!7847 leads to 20% slowdown here)

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  88,523,940,792 bytes allocated in the heap
         808,552 bytes copied during GC
     118,684,856 bytes maximum residency (213 sample(s))
       2,864,776 bytes maximum slop
             282 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     19574 colls,     0 par   14.921s  14.963s     0.0008s    0.0034s
  Gen  1       213 colls,     0 par    3.645s   3.654s     0.0172s    0.0331s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   30.808s  ( 30.826s elapsed)
  GC      time   18.566s  ( 18.616s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time   49.374s  ( 49.450s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,873,446,619 bytes per MUT second

  Productivity  62.4% of total user, 62.3% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  88,571,121,880 bytes allocated in the heap
         808,576 bytes copied during GC
     123,240,224 bytes maximum residency (209 sample(s))
       2,900,480 bytes maximum slop
             288 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     19579 colls,     0 par   15.005s  15.028s     0.0008s    0.0043s
  Gen  1       209 colls,     0 par    3.916s   3.919s     0.0188s    0.0334s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   39.076s  ( 39.060s elapsed)
  GC      time   18.921s  ( 18.947s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time   57.997s  ( 58.010s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,266,648,087 bytes per MUT second

  Productivity  67.4% of total user, 67.3% of total elapsed

The original branch extra-pragmas-BENCH2 mentioned early in his ticket that adds some likely and unlikely extra SPECIALIZE pragmas and proxies compared to branch issue-20:

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-8.10.7/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  62,690,403,808 bytes allocated in the heap
  39,888,985,560 bytes copied during GC
     123,908,360 bytes maximum residency (233 sample(s))
       2,916,184 bytes maximum slop
             278 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     53938 colls,     0 par   11.944s  11.956s     0.0002s    0.0017s
  Gen  1       233 colls,     0 par    4.230s   4.232s     0.0182s    0.0323s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   16.421s  ( 16.418s elapsed)
  GC      time   16.174s  ( 16.188s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   32.595s  ( 32.607s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,817,733,452 bytes per MUT second

  Productivity  50.4% of total user, 50.4% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  73,309,205,216 bytes allocated in the heap
  40,590,756,360 bytes copied during GC
     124,398,448 bytes maximum residency (230 sample(s))
       2,881,904 bytes maximum slop
             275 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     63941 colls,     0 par   12.315s  12.332s     0.0002s    0.0016s
  Gen  1       230 colls,     0 par    4.667s   4.670s     0.0203s    0.0329s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   18.501s  ( 18.499s elapsed)
  GC      time   16.982s  ( 17.001s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   35.483s  ( 35.500s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,962,535,916 bytes per MUT second

  Productivity  52.1% of total user, 52.1% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
 105,856,182,136 bytes allocated in the heap
       2,633,728 bytes copied during GC
     124,410,816 bytes maximum residency (220 sample(s))
       2,920,704 bytes maximum slop
             293 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     23559 colls,     0 par   14.657s  14.673s     0.0006s    0.0035s
  Gen  1       220 colls,     0 par    4.069s   4.072s     0.0185s    0.0306s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   36.184s  ( 36.197s elapsed)
  GC      time   18.727s  ( 18.745s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   54.911s  ( 54.942s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,925,461,743 bytes per MUT second

  Productivity  65.9% of total user, 65.9% of total elapsed

Branch extra-pragmas-BENCH2-HEAD that is similar to extra-pragmas-BENCH2, but has even more random proxies backing up SPECIALIZE, just in case, and the plugins gutted out to be compilable with HEAD:

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-8.10.7/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  58,827,983,808 bytes allocated in the heap
  38,493,956,792 bytes copied during GC
     124,405,240 bytes maximum residency (227 sample(s))
       2,894,304 bytes maximum slop
             285 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     49944 colls,     0 par   12.086s  12.104s     0.0002s    0.0017s
  Gen  1       227 colls,     0 par    3.613s   3.617s     0.0159s    0.0310s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   15.403s  ( 15.411s elapsed)
  GC      time   15.700s  ( 15.722s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   31.103s  ( 31.132s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,819,164,921 bytes per MUT second

  Productivity  49.5% of total user, 49.5% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  69,478,029,216 bytes allocated in the heap
  40,785,590,224 bytes copied during GC
     124,407,880 bytes maximum residency (228 sample(s))
       2,881,736 bytes maximum slop
             274 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     60443 colls,     0 par   12.557s  12.573s     0.0002s    0.0016s
  Gen  1       228 colls,     0 par    4.952s   4.955s     0.0217s    0.0329s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   17.392s  ( 17.393s elapsed)
  GC      time   17.510s  ( 17.528s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   34.902s  ( 34.921s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,994,815,350 bytes per MUT second

  Productivity  49.8% of total user, 49.8% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  58,764,870,096 bytes allocated in the heap
         574,432 bytes copied during GC
     124,410,472 bytes maximum residency (176 sample(s))
       2,879,632 bytes maximum slop
             295 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     12104 colls,     0 par   14.253s  14.264s     0.0012s    0.0053s
  Gen  1       176 colls,     0 par    4.956s   4.959s     0.0282s    0.0337s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   18.668s  ( 18.674s elapsed)
  GC      time   19.209s  ( 19.223s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   37.877s  ( 37.897s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,147,871,898 bytes per MUT second

  Productivity  49.3% of total user, 49.3% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  54,361,525,736 bytes allocated in the heap
         516,920 bytes copied during GC
     119,937,176 bytes maximum residency (176 sample(s))
       2,799,864 bytes maximum slop
             280 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     11110 colls,     0 par   13.856s  13.878s     0.0012s    0.0081s
  Gen  1       176 colls,     0 par    4.657s   4.663s     0.0265s    0.0333s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   16.279s  ( 16.278s elapsed)
  GC      time   18.513s  ( 18.541s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time   34.792s  ( 34.820s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,339,439,449 bytes per MUT second

  Productivity  46.8% of total user, 46.7% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  54,344,406,608 bytes allocated in the heap
         516,928 bytes copied during GC
     119,945,280 bytes maximum residency (176 sample(s))
       2,770,872 bytes maximum slop
             280 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     11111 colls,     0 par   13.697s  13.721s     0.0012s    0.0055s
  Gen  1       176 colls,     0 par    4.648s   4.655s     0.0264s    0.0332s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   19.119s  ( 19.127s elapsed)
  GC      time   18.344s  ( 18.375s elapsed)
  EXIT    time    0.000s  (  0.007s elapsed)
  Total   time   37.463s  ( 37.510s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,842,471,674 bytes per MUT second

  Productivity  51.0% of total user, 51.0% of total elapsed
Mikolaj commented 2 years ago

The results are inconclusive. On one hand, it seem the workaround from ghc-report-specialize was not enough for the newer code that has more complex types. The performance improved in extra-pragmas-BENCH2 thanks to adding all likely and unlikely SPECIALIZE pragmas and improved even more (especially in GHC 9.2.3) in extra-pragmas-BENCH2-HEAD by adding some extra proxies that back the pragmas. But still the performance does not match between the code versions that essentially differ only in typing and also between GHC versions.

I conjecture #14 is responsible for all the performance regained via workarounds and possibly for all or part of the remaining performance gap. I'm just not diligent enough to track down all pragmas that are needed as a workaround, so I might have missed some. Also GHC accepts pragmas that are void, because the types given do not suffice to resolve all type variables (which is why adding proxies that carry the valuation of the remaining type variables improves performance). Also, proxies are not possible in some operations, e.g., Num operations, so manual inlining and then specialization (and proxies) is needed and I did some of that, but patience failed me.

I think we should wait until #14 is resolved.

Mikolaj commented 2 years ago

I've retested BENCH2 on branch extra-pragmas-BENCH2-HEAD with

{-# INLINABLE sumTrainableInputs #-}
{-# INLINABLE sumConstantData #-}

instead of SPECIALIZE for the same functions. Memory behaviour is exactly the same, but runtime seems a bit faster.

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  54,344,406,608 bytes allocated in the heap
         516,936 bytes copied during GC
     119,945,280 bytes maximum residency (176 sample(s))
       2,770,872 bytes maximum slop
             280 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     11111 colls,     0 par   13.856s  13.868s     0.0012s    0.0065s
  Gen  1       176 colls,     0 par    4.657s   4.660s     0.0265s    0.0332s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   17.554s  ( 17.536s elapsed)
  GC      time   18.513s  ( 18.528s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time   36.067s  ( 36.070s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,095,861,693 bytes per MUT second

  Productivity  48.7% of total user, 48.6% of total elapsed

So this doesn't seem the magic bullet to align the performance of the two code snapshots (that probably differ only in typing), though I haven't tried all combinations. It may be worthwhile to at least try changing every SPECIALIZE to INLINABLE and test this straightforward variant, Additionally doing that for testMnist0 doesn't change anything, but there's a lot more SPECIALIZE on that branch and they matter for performance. [Edit: testMnist0 is not called in this benchmark, but doing that for nnMnistLoss0 in a different branch doesn't seem to affect performance either.]

Mikolaj commented 2 years ago

I've recorded three ticky profiles of the benchmark (BENCH2) with HEAD from 20220606 and https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 applied. They are with branches https://github.com/Mikolaj/horde-ad/tree/ghc-report-specialize, https://github.com/Mikolaj/horde-ad/tree/issue-20-without-plugins and https://github.com/Mikolaj/horde-ad/tree/extra-pragmas-BENCH2-HEAD as described above. The files are ghc-report-specialize.ticky.txt, issue-20-without-plugins.ticky.txt and extra-pragmas-BENCH2-HEAD.ticky.txt. The extra flag when compiling was --ghc-option=-ticky and when running was +RTS -rbranch_name.ticky.txt.

The results are quite striking: some numbers repeat exactly, because it's the same benchmark and the code (probably) differs only in types and in some extra components not used in the benchmark. Other numbers are completely different, but I don't know what they imply. E.g., of the tails calls, 100%, 88% and 60% have been to known functions, respectively, with very different absolute numbers. Returns "from entering a new constructor" are 95%, 50%, 0%, again with different absolute numbers. There are 500 times more updates in the worst case (the middle one).

In general, Delta.eval0 and and the functions it calls seems to be the culprit. The lines for that function and it's top-level host evalBindings are, respectively

  818525000  4365472000           0   2 {"type":"entCntr","subTy":"fun","fvs_c":4,"fvs":">MPi","args $weval0{v skDS} (BenchMnistTools) (fun) in rigR
        500    10704000           0   8 {"type":"entCntr","subTy":"fun","fvs_c":0,"fvs":"","args":"I BenchMnistTools.$w$s$wevalBindings1{v rigR} (fun)

  818525000 15297604000           0   2 {"type":"entCntr","subTy":"fun","fvs_c":17,"fvs":"+>IIIII.+S eval0{v sOOK} (HordeAd.Internal.Delta) (fun) in rOgs
        500    11204000           0   9 {"type":"entCntr","subTy":"fun","fvs_c":0,"fvs":"","args":"+ HordeAd.Internal.Delta.$wevalBindings{v rOgs} (fun)

  818525000 15297604000           0   2 {"type":"entCntr","subTy":"fun","fvs_c":17,"fvs":"+>IIIII.+S eval0{v sUin} (HordeAd.Internal.Delta) (fun) in rRSA
        500    11204000           0   9 {"type":"entCntr","subTy":"fun","fvs_c":0,"fvs":"","args":"+ HordeAd.Internal.Delta.$wevalBindings{v rRSA} (fun)

If I'm guessing correctly, in the fast case, the eval0 function was transformed with a worker/wrapper and somehow inlined in BenchMnistTools, unlike in the bad case and the desperately repaired bad case. The top level function evalBindings looks differently, as well, though performs similarly. Is it just that the function got bigger (it now also handles the shaped tensors, though it's dead code in the benchmark) and doesn't inline any more? Or is worker/wrapper transformation not triggered at all for some reason? I think we'd need to look at Core of BenchMnistTools, but it would help to understand the generated function names and other info we are likely to see.

Mikolaj commented 1 year ago

This is fixed on GHC 9.6.1, except perhaps for <10% discrepancies that have their GHC tickets open. Closing.