SPECIALIZE needed despite no warnings by -Wall-missed-specialisations

Mikolaj / horde-ad

Higher Order Reverse Derivatives Efficiently - Automatic Differentiation library based on the paper "Provably correct, asymptotically efficient, higher-order reverse-mode automatic differentiation"

BSD 3-Clause "New" or "Revised" License

33 stars 6 forks source link

SPECIALIZE needed despite no warnings by -Wall-missed-specialisations #14

Closed Mikolaj closed 1 year ago

Mikolaj commented 2 years ago

In short: despite -fexpose-all-unfoldings -fspecialise-aggressively, two SPECIALIZE pragmas are required to significantly speed up and lower allocation for the code. Confirmed with GHC 8.10.7, 9.0.2 and 9.2.1.

That would imply that there is a specialization that is not performed by GHC without the pragmas. However, despite -Wall-missed-specialisations, no missed specialization is reported in the code without SPECIALIZE (with 8.10.7 and 9.0.2; ~OTOH 9.2.1 spams so much that it's hard to tell; we should try 9.2.2 once it's out~ [edit: 9.2.2 repro done, see below; in short: less spam, but no relevant warning]).

It would be great to understand if these are GHC bugs that needs to be reported and whether we may have other not applied specilizations that -Wall-missed-specialisations is not telling us about. Is there a cheap way to find them? Can we fix them or rule them out somehow even without finding them individually? Any there any ideas emerging from this case about how to speed up this particular part of code (which is computing the value of the objective function, ignoring derivatives; #11)?

Reproducing BENCH1 with ghc-9.0.2:

git clone git@github.com:Mikolaj/horde-ad.git
git checkout ghc-report-abstract
cabal build --enable-benchmarks
`cabal list-bin mnist` -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s

This provides the baseline, in which things are apparently not specialized. Note that there is no warning about missing specialization. For me, the results are

 509,766,256,688 bytes allocated in the heap
  Total   time  151.546s  (151.631s elapsed)

The following only adds a commit that introduces the two SPECIALIZE pragmas. [Edit: and then two others that I added only to eliminate thunks that would obscure what's going on, No significant change of results due to their elimination.]

git checkout ghc-report-specialize
cabal build --enable-benchmarks
`cabal list-bin mnist` -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s

My results are

   2,463,534,528 bytes allocated in the heap
  Total   time    8.750s  (  8.757s elapsed)

The real story, the one recorded on master branch, is that I originally applied a workaround that apparently specialized some things without needing SPECIALIZE. It sits halfway between the two extremes above performance-wise. Probably not important, but it may indicate where the lack of specialization is located. Commit https://github.com/Mikolaj/horde-ad/commit/229a9b834c80a1a98ece769ee49a634951f25405 from master branch, while adding the two SPECIALIZE, removes the workaround (and the commit message fails to explain what's going on).

Mikolaj commented 2 years ago

Update re GHC 9.2.2:

OTOH 9.2.1 spams so much that it's hard to tell; we should try 9.2.2 once it's out).

GHC 9.2.2 seems to spam just as much, which makes -Wall-missed-specialisations unusable. I think the fix was not backported to 9.2.2, so we'd need to check master branch. The offending functions are probably mentioned in the -Wall-missed-specialisations output, but it's hard to pick up from the noise --- definitely there is a difference in size between the output with and without specialization, but the diffs are big.

Mikolaj commented 2 years ago

I've reproduced exactly the 2 scenarios with 9.2.2, with the same results as with 9.2.1.

The fix for too verbose -Wall-missed-specialisations was back-ported, after all:

https://gitlab.haskell.org/ghc/ghc/-/issues/20744#note_412503

and when I looked more closely, indeed the spam is much less. However, I could not find the problematic functions, those for which SPECIALIZE was necessary, in the log. OTOH, -Wmissed-specialisations produces no warnings at all.

simonpj commented 2 years ago

I got stuck on teh cabal build step with

Failed to build zlib-0.6.2.3. The failure occurred during the configure step.
Build log (
/home/simonpj/.cabal/logs/ghc-8.10.7/zlib-0.6.2.3-def443548e978facc1ac2e0de278eca4693b87b4f14d6e73f4b00178d6270d82.log
):
Configuring library for zlib-0.6.2.3..
cabal-3.6.2.0: Missing dependency on a foreign library:
* Missing (or bad) header file: zlib.h
* Missing (or bad) C library: z
This problem can usually be solved by installing the system package that
provides this library (you may need the "-dev" version). If the library is
already installed but in a non-standard location then you can use the flags
--extra-include-dirs= and --extra-lib-dirs= to specify where it is.If the
library file does exist, it may contain errors that are caught by the C
compiler at the preprocessing stage. In this case you can re-run configure
with the verbosity flag -v3 to see the error messages.
If the header file does exist, it may contain errors that are caught by the C
compiler at the preprocessing stage. In this case you can re-run configure
with the verbosity flag -v3 to see the error messages.

Not sure what to do next. I guess I have to install something -- but what?

tomjaguarpaw commented 2 years ago

If you're on an Debian-based system (e.g. Ubuntu) then

sudo apt install zlib1g-dev

Mikolaj commented 2 years ago

Later on there are also harder deps to fill described here: https://github.com/Mikolaj/horde-ad/blob/master/README.md#compilation-from-source

Mikolaj commented 2 years ago

BTW, the second commit (with the two SPECIALIZE) produces such crazy results (best to just do git checkout ghc-report-specialize; master branch has similar results) in BENCH2:

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
 720,813,256,744 bytes allocated in the heap
      17,976,640 bytes copied during GC
     119,619,152 bytes maximum residency (1882 sample(s))
       2,890,176 bytes maximum slop
             278 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     158435 colls,     0 par   158.464s  158.741s     0.0010s    0.0071s
  Gen  1      1882 colls,     0 par   31.475s  31.519s     0.0167s    0.0331s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time  274.205s  (274.402s elapsed)
  GC      time  189.940s  (190.261s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time  464.145s  (464.663s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,628,741,040 bytes per MUT second

  Productivity  59.1% of total user, 59.1% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
 287,734,398,208 bytes allocated in the heap
 382,310,058,648 bytes copied during GC
     124,421,656 bytes maximum residency (2157 sample(s))
       2,946,616 bytes maximum slop
             275 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     219047 colls,     0 par   122.137s  122.312s     0.0006s    0.0026s
  Gen  1      2157 colls,     0 par   32.398s  32.433s     0.0150s    0.0311s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   81.031s  ( 81.032s elapsed)
  GC      time  154.535s  (154.746s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time  235.566s  (235.778s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,550,936,491 bytes per MUT second

  Productivity  34.4% of total user, 34.4% of total elapsed

I'm just trying to unravel that. It may be related to laziness and the dirty tricks criterion does to make sure it measures only the benchmarked code, excluding the setup code. However, just adding full deepseq forcing to the benchmarked function doesn't change anything [edit: pushed to branch ghc-report-specialize and verified]. The wild differences in allocation and GC may also be related to whether hmatrix allocates stuff on the heap or somewhere in C space, or to pinned/unpinned, but I don't know of any recent changes in GHC or libraries about that.

Edit: GHC 8.10.7 gives almost the same results as GHC 9.0.2.

Edit2: and the fact that I can only use GHC 9.2 in the recent code (due to type-level pattern matching that is only in 9.2. see #13), makes it so much harder to compare perf and discover particular GHC version flukes vs. consistent perf changes.

Edit3: perhaps what's going is that GHC 9.2.2 (or some newer library versions enabled by GHC 9.2.2) fail to fuse, e.g., vectors. That can cause much more allocation (allocation done not by GC, to be clear) and so much higher mutation time (worse cache locality), despite less time spent in GC. However, why failed fusion would cause so much less time spent in GC eludes me. Unless this has an independent cause.

Edit4: this ticket is important also because our benchmarks vs other Haskell libraries are worth nothing if we have such huge unexplained anomalies. Benchmarks against Python/C libraries may also give a wrong impression if the Haskell side is handicapped for no fundamental reason.

Edit5: the -h heap profile shows a big difference in allocating the HordeAd.Core.Delta.DeltaScalar values, which are gathered in the benchmarked function, but never unused. Perhaps they are evaluated in GHC 9.2.2, because the HordeAd.Core.DualNumber.DualNumber datatype is strict (StrictData), but not in GHC 9.0.2, because they are unused? Perhaps GHC 9.2.2 has a more aggressive (more correct?) strictness analysis? However, making the relevant filed lazy

data DualNumber a = D a ~(DeltaExpression a)

makes the -s report for GHC 9.2.2 significantly worse rather than better (which is counterintuitive). So perhaps GHC 9.2.2 is worse at detecting and ignoring dead code?

Mikolaj commented 2 years ago

The result of mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s -hi -l-aug (BENCH2) from branch ghc-report-specialize with GHC 9.2.2 doesn't show much:

Profiling output (on master branch, commit 1f7926f6bf039163e67238a6bfcb3bfdbd6d9fe9, performance is similar, but profile looks quite different to branch ghc-report-specialize, after both diverged a bit lately, but that's probably just a difference in how profiling skews results) for the 9.2.2 run above (the one that is twice slower and allocates 3 times more than 9.0.2, but does almost no GC) does not confirm the failed fusion hypothesis. It assigns most of allocations to this code that should not permanently allocate anything (so perhaps there are some transient allocations? or is ST broken with GHC 9.2.2?):

https://github.com/Mikolaj/horde-ad/blob/b02684896f48c12735a11d02ed6409ae99ddfbb9/src/HordeAd/Core/Delta.hs#L341-L343

COST CENTRE                     MODULE                       SRC                                                 %time %alloc

buildVectors.eval0              HordeAd.Core.Delta           src/HordeAd/Core/Delta.hs:(339,7)-(350,41)           22.0    7.4
var                             HordeAd.Core.PairOfVectors   src/HordeAd/Core/PairOfVectors.hs:57:1-70            15.0    8.9
scale                           HordeAd.Core.DualNumber      src/HordeAd/Core/DualNumber.hs:108:1-42              12.1    5.3
sumConstantData                 HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:(47,1)-(53,33)   11.1   10.7
+                               HordeAd.Core.DualNumber      src/HordeAd/Core/DualNumber.hs:50:3-42                9.7    6.3
buildVectors.eval0.\            HordeAd.Core.Delta           src/HordeAd/Core/Delta.hs:342:21-42                   7.4   23.6
sumConstantData.f               HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:(50,7)-(52,26)    5.4    0.0
sumConstantData.f.v             HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:51:13-46          4.3    6.4
*                               HordeAd.Core.DualNumber      src/HordeAd/Core/DualNumber.hs:52:3-64                2.7    4.4
buildVectors.eval0.\            HordeAd.Core.Delta           src/HordeAd/Core/Delta.hs:343:29-52                   2.6    7.4
buildVectors.eval0.\            HordeAd.Core.Delta           src/HordeAd/Core/Delta.hs:341:23-37                   2.1   11.8

Eventlog profiling and heap profiles show no relevant thunks (on master branch I squelched most innocent thunks just to be sure).

Profiling output from branch ghc-report-specialize with GHC 9.2.2 shows a more believable story, but still no explanation for the difference vs 9.0.2 and still no support for the broken vector fusion hypothesis:

COST CENTRE                     MODULE                       SRC                                                 %time %alloc

buildVectors.eval0              HordeAd.Core.Delta           src/HordeAd/Core/Delta.hs:(171,7)-(179,60)           21.3    0.0
var                             HordeAd.Core.PairOfVectors   src/HordeAd/Core/PairOfVectors.hs:50:1-64            17.4   17.9
scale                           HordeAd.Core.DualNumber      src/HordeAd/Core/DualNumber.hs:99:1-42               13.6   10.7
sumConstantData                 HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:(47,1)-(53,33)   13.0   21.4
+                               HordeAd.Core.DualNumber      src/HordeAd/Core/DualNumber.hs:41:3-42               11.3   12.8
sumConstantData.f               HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:(50,7)-(52,26)    5.3    0.0
sumConstantData.f.v             HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:51:13-46          4.7   12.9
*                               HordeAd.Core.DualNumber      src/HordeAd/Core/DualNumber.hs:43:3-64                3.1    8.8

Analogous profile for GHC 9.0.2 is quite similar.

Mikolaj commented 2 years ago

@simonpj: A remark about the code that needs the two SPECIALIZE pragmas (BENCH1): this is a thick code path, where most of the code is not used (gradient and object function result are computed, the gradient is thrown away and only the result is kept). So perhaps the specialization is lost when some parts of the code are simplified away? Or some types are not instantiated, because the gradient is thrown away, and this prevents specialization somehow?

Another hint: a part of the problem goes away (the performance is improve partially) when an associated type (DeltaExpression) is manually resolved and inlined in some function signature.

Mikolaj commented 2 years ago

@simonpj: Might this be related: https://gitlab.haskell.org/ghc/ghc/-/issues/20709 ? If so, should I compile HEAD, try to compile horde-ad and check if any of the errors above is fixed?

Or is there a way to inhibit worker/wrapper or inlining of the wrapper?

simonpj commented 2 years ago

If so, should I compile HEAD, try to compile horde-ad and check if any of the errors above is fixed?

It would be great if you could do that. I still have no idea if it is related thought

Mikolaj commented 2 years ago

First results of BENCH2 with HEAD and head.hackage: the runtime and allocation regression in 9.2.2 is mostly gone [Edit: this is correct; only 14% regression in time and alloc remains]]:

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  34,321,569,296 bytes allocated in the heap
         753,488 bytes copied during GC
     120,328,568 bytes maximum residency (193 sample(s))
       2,872,192 bytes maximum slop
             255 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6592 colls,     0 par   14.090s  14.108s     0.0021s    0.0057s
  Gen  1       193 colls,     0 par    2.845s   2.848s     0.0148s    0.0314s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.335s  (  8.320s elapsed)
  GC      time   16.935s  ( 16.957s elapsed)
  EXIT    time    0.000s  (  0.004s elapsed)
  Total   time   25.270s  ( 25.280s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,117,838,210 bytes per MUT second

  Productivity  33.0% of total user, 32.9% of total elapsed

This is much closer to 9.2.2 results now. One thing that didn't change is copied during GC, but that's probably a positive.

Edit: this is solved most probably by the fix to https://gitlab.haskell.org/ghc/ghc/-/issues/20364 / https://gitlab.haskell.org/ghc/ghc/-/issues/20709 that is in HEAD.

Mikolaj commented 2 years ago

And with GHC HEAD there is no difference in -s output after removing the two SPECIALIZE pragmas. [Edit: I was wrong, because I used a wrong test (BENCH2, the one for the 9.2.2 regression). Below I confirm the problem with BENCH1 and needing the SPECIALIZE pragmas is only partially solved.]

The instructions to reproduce the HEAD experiment [the instructions are for BENCH1 and I should have used that, but I used BENCH2 instead];

git clone git@github.com:Mikolaj/horde-ad.git
git checkout ghc-report-specialize
build HEAD with Hadrian
add cabal.project.local file with (see http://ghc.gitlab.haskell.org/head.hackage)

repository head.hackage.ghc.haskell.org
   url: https://ghc.gitlab.haskell.org/head.hackage/
   secure: True
   key-threshold: 3
   root-keys:
       7541f32a4ccca4f97aea3b22f5e593ba2c0267546016b992dfadcd2fe944e55d
       26021a13b401500c8eb2761ca95c61f2d625bfef951b939a8124ed12ecf07329
       f76d08be13e9a61a377a85e2fb63f4c5435d40f8feb3e12eb05905edb8cdea89

cabal update
cabal build -w /path_to_ghc/_build/stage1/bin/ghc --allow-newer --enable-benchmarks
and then un the BENCH1 benchmark, substituting a correct path of the binary

    /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220610/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s

simonpj commented 2 years ago

And with GHC HEAD there is no difference -s output after removing the two SPECIALIZE pragmas.

Do you mean: with HEAD the problem is solved?

Mikolaj commented 2 years ago

And with GHC HEAD there is no difference -s output after removing the two SPECIALIZE pragmas.

Do you mean: with HEAD the problem is solved?

[Edit: I was wrong, because I used a wrong test (BENCH2, the one for the 9.2.2 regression). The problem is only partially solved.] The problem that SPECIALIZE is needed for BENCH1 despite -fexpose-all-unfoldings -fspecialise-aggressively is completely solved. The problem that there is no warning about the lack of specialization despite -Wall-missed-specialisations or despite -Wmissed-specialisations (which produces no warning at all) is probably not solved, but no way to reproduce it now. The 2-fold runtime and 3-fold alloc regression in 9.2.2 vs 9.0.2 in BENCH2 is solved partially (there is 14% time and alloc regression still).

simonpj commented 2 years ago

The 2-fold time and 3-fold alloc regression in 9.2.2 vs 9.0.2 is solved partially (there is 14% time and alloc regression still).

So HEAD is still 14% slower than 9.0? Can you give specific repro instructions? Thanks

Mikolaj commented 2 years ago

So HEAD is still 14% slower than 9.0? Can you give specific repro instructions? Thanks

Yes, I just repeated the tests. The instructions are as above, but run 500/500 (BENCH2) instead of 500/test 500 (BENCH1). The same with 9.0.2.

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  34,321,569,296 bytes allocated in the heap
         753,496 bytes copied during GC
     120,328,568 bytes maximum residency (193 sample(s))
       2,872,192 bytes maximum slop
             255 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6592 colls,     0 par   14.022s  14.043s     0.0021s    0.0062s
  Gen  1       193 colls,     0 par    2.969s   2.973s     0.0154s    0.0322s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.326s  (  8.314s elapsed)
  GC      time   16.991s  ( 17.016s elapsed)
  EXIT    time    0.000s  (  0.010s elapsed)
  Total   time   25.317s  ( 25.340s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,122,453,209 bytes per MUT second

  Productivity  32.9% of total user, 32.8% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  29,993,632,336 bytes allocated in the heap
  36,535,814,368 bytes copied during GC
     119,503,384 bytes maximum residency (253 sample(s))
       2,919,120 bytes maximum slop
             268 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22918 colls,     0 par   11.927s  11.945s     0.0005s    0.0016s
  Gen  1       253 colls,     0 par    1.969s   1.972s     0.0078s    0.0307s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    7.982s  (  7.983s elapsed)
  GC      time   13.896s  ( 13.917s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   21.878s  ( 21.900s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,757,806,371 bytes per MUT second

  Productivity  36.5% of total user, 36.5% of total elapsed

Mikolaj commented 2 years ago

BTW, in some similar tests earlier on I had 20% speedup in 9.2.2 vs 9.0.2, so this may be a problem that eats up the 20% improvement (e.g., from the low GC copying) and adds 14% degradation on top.

Mikolaj commented 2 years ago

Actually, for some reason I'm running a different test (BENCH2 instead of BENCH1). Let me retry with the one from the description of this ticket (BENCH1).

Mikolaj commented 2 years ago

I apologise. I forgot there are two different tests, one for the 2 SPECIALIZE pragmas problem (BENCH1, present in 8.10.7, 9.0.2, 9.2.2, partially in HEAD), another for the slowdown and allocation blowup problem with the pragmas in place (BENCH2, regression in 9.2.2, fixed in HEAD, except for 14%). I was right HEAD mostly fixes the latter problem (14% slowdown and blowup remains). I was wrong that it fixes the former, because I used the wrong test to verify that.

The 2 SPECIALIZE pragmas problem in BENCH1 is only partially mitigated in HEAD. Also some new problems seem to be present in HEAD that were not in 9.2.2.

Here are the results for BENCH1 from branch ghc-report-specialize (meaning, with the 2 SPECIALIZE pragmas):

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
  39,694,459,432 bytes allocated in the heap
       1,054,624 bytes copied during GC
      75,629,632 bytes maximum residency (11 sample(s))
       2,014,488 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9411 colls,     0 par    0.223s   0.223s     0.0000s    0.0013s
  Gen  1        11 colls,     0 par    0.022s   0.022s     0.0020s    0.0052s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   11.956s  ( 11.961s elapsed)
  GC      time    0.245s  (  0.244s elapsed)
  EXIT    time    0.000s  (  0.004s elapsed)
  Total   time   12.202s  ( 12.210s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,319,972,986 bytes per MUT second

  Productivity  98.0% of total user, 98.0% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
   2,208,754,088 bytes allocated in the heap
          30,112 bytes copied during GC
      71,871,304 bytes maximum residency (9 sample(s))
       1,610,936 bytes maximum slop
             144 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       473 colls,     0 par    0.030s   0.030s     0.0001s    0.0012s
  Gen  1         9 colls,     0 par    0.013s   0.013s     0.0014s    0.0039s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    7.680s  (  7.713s elapsed)
  GC      time    0.043s  (  0.043s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    7.723s  (  7.756s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    287,584,323 bytes per MUT second

  Productivity  99.4% of total user, 99.4% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
   2,463,534,528 bytes allocated in the heap
      31,052,832 bytes copied during GC
      63,802,840 bytes maximum residency (10 sample(s))
         939,224 bytes maximum slop
             129 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2173 colls,     0 par    0.050s   0.050s     0.0000s    0.0011s
  Gen  1        10 colls,     0 par    0.011s   0.011s     0.0011s    0.0035s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.539s  (  8.554s elapsed)
  GC      time    0.061s  (  0.061s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    8.600s  (  8.615s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    288,502,093 bytes per MUT second

  Productivity  99.3% of total user, 99.3% of total elapsed

And here are results with both SPECIALIZE pragmas removed (not other change):

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
 208,834,659,432 bytes allocated in the heap
      10,947,536 bytes copied during GC
      75,686,056 bytes maximum residency (16 sample(s))
       2,021,696 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     49915 colls,     0 par    1.063s   1.068s     0.0000s    0.0012s
  Gen  1        16 colls,     0 par    0.041s   0.041s     0.0026s    0.0060s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   16.940s  ( 16.943s elapsed)
  GC      time    1.105s  (  1.109s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time   18.045s  ( 18.060s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    12,328,026,045 bytes per MUT second

  Productivity  93.9% of total user, 93.8% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
 641,729,879,528 bytes allocated in the heap
      24,952,744 bytes copied during GC
      75,682,672 bytes maximum residency (16 sample(s))
       2,022,216 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     153353 colls,     0 par    2.523s   2.532s     0.0000s    0.0012s
  Gen  1        16 colls,     0 par    0.041s   0.041s     0.0026s    0.0057s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time  201.875s  (202.046s elapsed)
  GC      time    2.564s  (  2.573s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time  204.440s  (204.619s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,178,841,045 bytes per MUT second

  Productivity  98.7% of total user, 98.7% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
 509,766,256,688 bytes allocated in the heap
  13,188,041,760 bytes copied during GC
      75,670,952 bytes maximum residency (18 sample(s))
       2,020,440 bytes maximum slop
             153 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     486527 colls,     0 par    6.651s   6.719s     0.0000s    0.0011s
  Gen  1        18 colls,     0 par    0.042s   0.042s     0.0023s    0.0051s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time  134.245s  (134.266s elapsed)
  GC      time    6.693s  (  6.761s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time  140.938s  (141.027s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,797,286,411 bytes per MUT second

  Productivity  95.3% of total user, 95.2% of total elapsed

Mikolaj commented 2 years ago

I've repeated the SPECIALIZE test BENCH1 in GHC 9.2.2 with {-# OPTIONS_GHC -fno-worker-wrapper #-} in the two files where SPECIALIZE pragmas were located. I'm getting the same '-s' report with the 2 SPECIALIZE pragmas present and without the pragmas. That means that either -fno-worker-wrapper breaks specialization or that it actually fixes the very two specializations that the two pragmas were forcing. Both runtime and allocation is in-between the extremes from the previous comment, so it's likely -fno-worker-wrapper helps with the specialization, but impedes performance elsewhere. That would confirm the connection to https://gitlab.haskell.org/ghc/ghc/-/issues/20364 and https://gitlab.haskell.org/ghc/ghc/-/issues/20709.

Mikolaj commented 2 years ago

@simonpj: may I help with this one any more?

simonpj commented 2 years ago

Notes after a call with Mikolaj. (@Mikolaj please feel free to edit) [MK: edited; and again]

Current status

           Before recent changes             After recent changes and with a few pragmas for BENCH1
           on branch ghc-report-specialize   on branch extra-pragmas 
                                             (in parens results with a dozen extra pragmas for BENCH2
                                              on branch extra-pragmas-BENCH2)

BENCH1     SPECIALISE  no SPEC prags         SPECIALISE    no SPEC prags
           prags       -fspec-agg etc        prags

8.10.7       Fast         Slow               Fast          Slow
9.0.2        Fast         Slow               Slow+         Slow
9.2.2        Fast         Slow               Fast          Slow
HEAD         Fast-        Slow+              ?             ?

BENCH2     SPECIALISE  no SPEC prags         SPECIALISE    no spec prags
           prags       -fspec-agg etc        prags

8.10.7       Fast                            Slow (Slow+)
9.0.2        Fast                            Slow (Slow+)
9.2.2        Slow                            Slow (Slow)
HEAD         Fast-                           ?

Fast- is less fast and Slow+ is less slow, but there's still a huge difference between the two.

Can't test with HEAD in branch extra-pragmas because it doesn't work with typechecker plugins. Probably just needs updates of the plugins, in head.hackage.
Key SPECIALISE pragma is in src/HordeAd/Tool/MnistFcnnScalar.hs, function {-# SPECIALIZE testMnist2 :: Int -> Int -> [MnistData Double] -> Domain Double -> Double #-} In particular, only this pragma has some effect on branch extra-pragmas.
That function fires when compiling bench/BenchMnistTools.hs
[1] Command line to compile src/HordeAd/Tool/MnistFcnnScalar.hs
[2] Command line to compile bench/BenchMnistTools.hs

[1] /home/simonpj/code/HEAD-5/inplace/bin/ghc-stage2 -c -fbuilding-cabal-package -O -static -dynamic-too -dynosuf dyn_o -dynhisuf dyn_hi -outputdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -odir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -hidir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -stubdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -i -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -isrc -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -optP-include -optP/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen/cabal_macros.h -this-unit-id horde-ad-0.1.0.0-inplace -hide-all-packages -Wmissing-home-modules -no-user-package-db -package-db /home/simonpj/.cabal/store/ghc-9.3.20220316/package.db -package-db /home/simonpj/code/horde-ad/dist-newstyle/packagedb/ghc-9.3.20220316 -package-db /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/package.conf.inplace -package-id assert-failure-0.1.2.5-ea675a91e1e142052b8c09a6b71455a91a11c690fd2d01a36ad7f4cb68430354 -package-id base-4.16.0.0 -package-id bytestring-0.11.2.0 -package-id hmatrix-0.20.2-10eb67f511637ef41e2613b440db02f1222c9526116b87f7594e67978da7419b -package-id mnist-idx-0.1.3.0-93e628f7c64901a8a0e6eca70046a8335a120ecf78872be7a9055068328d629f -package-id pretty-show-1.10-ec297dfb3c4a097ca84c313cd9d74af54aa9c25eebb805b0f274cb7b06664ee1 -package-id random-1.2.1-dff0ac74d3869fa8e6c81ef2173a538040d7f12e33a327a869fd889135fe0d0a -package-id strict-containers-0.1-39349c55f55f295c5b921956ce232b6925f1851889adcb02cbb2cf17a2a8c34a -package-id transformers-0.5.6.2 -package-id vector-0.12.3.1-29f10f436fed96d2e5db20c8b101f6ed402d245d4282e430571bfefe8a438d14 -package-id zlib-0.6.2.3-c4451d988dbea589a8ceb6952bdc1d6e539a0b6208ae97abe04d017e7edb9c13 -XHaskell2010 -XMonoLocalBinds -XScopedTypeVariables -XOverloadedStrings -XBangPatterns -XRecordWildCards -XNamedFieldPuns -XMultiWayIf -XLambdaCase -XDefaultSignatures -XInstanceSigs -XPatternSynonyms -XStrictData -XTypeApplications -XFlexibleContexts src/HordeAd/Tool/MnistFcnnScalar.hs -Wall -Wcompat -Worphans -Wincomplete-uni-patterns -Wincomplete-record-updates -Wimplicit-prelude -Wmissing-home-modules -Widentities -Wredundant-constraints -Wmissing-export-lists -Wpartial-fields -Wunused-packages -fno-ignore-asserts -fexpose-all-unfoldings -fslpecialise-aggressively '-fsimpl-tick-factor=200' -Wall-missed-specialisations -hide-all-packages -fforce-recomp -ddump-simpl >& fcnn-scalar-no-spec.stg

[2] /home/simonpj/code/HEAD-5/inplace/bin/ghc-stage2 -c -no-link -fbuilding-cabal-package -O -static -dynamic-too -dynosuf dyn_o -dynhisuf dyn_hi -outputdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -odir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -hidir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -stubdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -i -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -ibench -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -optP-include -optP/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen/cabal_macros.h -hide-all-packages -Wmissing-home-modules -no-user-package-db -package-db /home/simonpj/.cabal/store/ghc-9.3.20220316/package.db -package-db /home/simonpj/code/horde-ad/dist-newstyle/packagedb/ghc-9.3.20220316 -package-db /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/package.conf.inplace -package-id base-4.16.0.0 -package-id criterion-1.5.13.0-5c0f15f9ba93e89d1ffb286bc5c4858ea41ec14be80fab0303ea7a74d39e171e -package-id deepseq-1.4.7.0 -package-id horde-ad-0.1.0.0-inplace -package-id random-1.2.1-dff0ac74d3869fa8e6c81ef2173a538040d7f12e33a327a869fd889135fe0d0a -package-id vector-0.12.3.1-29f10f436fed96d2e5db20c8b101f6ed402d245d4282e430571bfefe8a438d14 -XHaskell2010 -XMonoLocalBinds -XScopedTypeVariables -XOverloadedStrings -XBangPatterns -XRecordWildCards -XNamedFieldPuns -XMultiWayIf -XLambdaCase -XDefaultSignatures -XInstanceSigs -XPatternSynonyms -XStrictData -XTypeApplications -XFlexibleContexts bench/BenchMnistTools.hs -Wall -Wcompat -Worphans -Wincomplete-uni-patterns -Wincomplete-record-updates -Wimplicit-prelude -Wmissing-home-modules -Widentities -Wredundant-constraints -Wmissing-export-lists -Wpartial-fields -Wunused-packages -fno-ignore-asserts -fexpose-all-unfoldings -fspecialise-aggressively '-fsimpl-tick-factor=200' -Wall-missed-specialisations -rtsopts -hide-all-packages -fforce-recomp -ddump-rule-firings -fforce-recomp >& foo-rule-fired

Mikolaj commented 2 years ago

Notes from today's meeting (more in the google doc): the current idea is "never unbox dictionaries in w/w" and the agreed branch is GHC 9.2 (but branch 9.0 would not be a bad choice, either, because it's only one on which the original workaround fails after recent changes to the codebase, even with extra pragmas and proxies for them (can't compile on HEAD); or perhaps I can backport from 9.2 to 9.0 for my tests).

Mikolaj commented 2 years ago

Filed as https://gitlab.haskell.org/ghc/ghc/-/issues/21286.

Mikolaj commented 2 years ago

If backporting the fix to 9.2 (and 9.0?) is too hard, an alternative is to port get horde-ad buildable on HEAD, which requires porting https://hackage.haskell.org/package/ghc-tcplugins-extra (which may be enough or not to get ghc-typelits-knownnat and ghc-typelits-natnormalise ported) to HEAD and submiting a merge request to https://gitlab.haskell.org/ghc/head.hackage (and to the ghc-tcplugins-extra repo to let them know).

Edit: https://github.com/clash-lang/ghc-tcplugins-extra/issues/21

Mikolaj commented 2 years ago

I wasn't able to port https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 to GHC 9.2 and the type-level plugins we use are not ported to 9.4 nor HEAD yet, so I'm testing with the same old branch ghc-report-specialize as before that did not yet use the plugins. Let me first record all the data I gathered and summarize later on in another comment.

I fully reproduced with GHC 9.2.3 the recorded results with GHC 9.2.2 and BENCH1 and BENCH2 (the recorded data was for 10 runs of BENCH2 on a slightly older commit, but it matches well enough, considering). I also reproduced that removing the two SPECIALIZE pragmas from the program wreaks performance with BENCH1, but does not affect BENCH2, just as before. Here are the full results with the pragmas (BENCH1, then BENCH2):

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
   2,208,754,088 bytes allocated in the heap
          30,120 bytes copied during GC
      71,871,304 bytes maximum residency (9 sample(s))
       1,610,936 bytes maximum slop
             144 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       473 colls,     0 par    0.030s   0.030s     0.0001s    0.0013s
  Gen  1         9 colls,     0 par    0.013s   0.013s     0.0014s    0.0037s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    7.814s  (  7.816s elapsed)
  GC      time    0.042s  (  0.042s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    7.856s  (  7.858s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    282,680,033 bytes per MUT second

  Productivity  99.5% of total user, 99.5% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  73,220,358,784 bytes allocated in the heap
       1,808,752 bytes copied during GC
     119,593,208 bytes maximum residency (201 sample(s))
       2,928,528 bytes maximum slop
             262 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     16080 colls,     0 par   13.746s  13.759s     0.0009s    0.0030s
  Gen  1       201 colls,     0 par    2.945s   2.946s     0.0147s    0.0279s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   26.141s  ( 26.144s elapsed)
  GC      time   16.691s  ( 16.705s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   42.832s  ( 42.849s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,800,962,649 bytes per MUT second

  Productivity  61.0% of total user, 61.0% of total elapsed

Then I tried GHC 9.5.20220606 that contains https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7985 and https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7997, but not https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847. The results recorded for ghc-9.3.20220316 match very closely (within 1% plus error) both in BENCH1 and BENCH2. I also reproduced that removing the two SPECIALIZE pragmas from the program wreaks performance with BENCH1, but does not affect BENCH2, exactly as before. Here are the full results with the pragmas (BENCH1, then BENCH2):

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
  39,778,850,976 bytes allocated in the heap
       1,055,328 bytes copied during GC
      75,622,104 bytes maximum residency (11 sample(s))
       2,011,136 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9431 colls,     0 par    0.220s   0.221s     0.0000s    0.0020s
  Gen  1        11 colls,     0 par    0.022s   0.022s     0.0020s    0.0062s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   11.872s  ( 11.876s elapsed)
  GC      time    0.243s  (  0.243s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time   12.115s  ( 12.120s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,350,632,757 bytes per MUT second

  Productivity  98.0% of total user, 98.0% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  34,265,500,784 bytes allocated in the heap
         346,384 bytes copied during GC
     120,458,736 bytes maximum residency (190 sample(s))
       2,865,072 bytes maximum slop
             276 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6595 colls,     0 par   14.240s  14.264s     0.0022s    0.0060s
  Gen  1       190 colls,     0 par    3.091s   3.096s     0.0163s    0.0298s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.555s  (  8.549s elapsed)
  GC      time   17.331s  ( 17.360s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time   25.886s  ( 25.910s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,005,410,638 bytes per MUT second

  Productivity  33.0% of total user, 33.0% of total elapsed

Then I tried the same master branch of GHC but with cherry-picked all 3 commits from https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847, called 9.5.20220608. The results are the same as for GHC 9.5.20220606 with both BENCH1 and BENCH2, except that after removing the two SPECIALIZE pragmas from the program, the results do not change at all [edit: unfortunately, I can't reproduce this now, however hard I try; the lack of specialization seems to persist]. Therefore, the main issue from this ticket, the need for the pragmas despite -fexpose-all-unfoldings -fspecialise-aggressively, is completely gone [edit: alas, probably not].

Mikolaj commented 2 years ago

Summarizing, main issue from this ticket, the need for the pragmas despite -fexpose-all-unfoldings -fspecialise-aggressively, is completely gone [edit: unfortunately, I can't reproduce this now, however hard I try; the lack of specialization seems to persist] with the https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 fix. Also, the fix does not introduce any performance regressions in the two benchmarks being tested (as measured with the SPECIALIZE pragma workarounds in place).

Given that 9.5.20220608 does not have better performance than 9.2.3 with the two workaround SPECIALIZE pragmas, it seems the workaround is completely effective and the only effect of the 7847 fix is making the workaround spurious (TODO: remove the workaround once we switch to GHC >= 9.6 and till then run our benchmarks periodically with HEAD to ensure we don't miss any SPECIALIZE pragmas that need adding as the library changes; [edit: and actually add a few dozen pragmas right now that it's clear GHC 9.2 won't get the fix]).

There are minor issues visible in the data above and both are filed together as https://gitlab.haskell.org/ghc/ghc/-/issues/21715.

BENCH1 compiled with HEAD (with the 7847 fix, or with workaround, it's all the same) allocates 18 times more than 9.2.3, copies 35 times more, but has only 5% more maximum residency and uses 9% more total memory. It runs 54% slower. (GHC 9.0.2 is close to 9.2.3, but takes somewhat less total memory still, as seen in older comments.)

In BENCH2, 9.2.3 produces a binary that is is 65% slower (MUT time) than from HEAD but, based on older data above, GHC 9.0.2 produces a binary that is 18% faster than HEAD.

simonpj commented 2 years ago

BENCH1 compiled with HEAD (with the 7847 fix, or with workaround, it's all the same) allocates 18 times more than 9.2.3, copies 35 times more, but has only 5% more maximum residency and uses 9% more total memory. It runs 54% slower. (GHC 9.0.2 is close to 9.2.3, but takes somewhat less total memory still, as seen in older comments.)

In BENCH2, 9.2.3 produces a binary that is is 65% slower (MUT time) than from HEAD but, based on older data above, GHC 9.0.2 produces a binary that is 18% faster than HEAD.

These don't sound like "minor issues"! Are the precisely the issues you raise in GHC ticket 21715? There it says "54% slower whereas here you say "18 times slower". That's a big difference. I'm clearly confused

Mikolaj commented 2 years ago

These don't sound like "minor issues"!

Compared to 18 times slowdown fixed by 7847, 54% and 18% slowdown is minor. Also, I reported it here (e.g., at https://github.com/Mikolaj/horde-ad/issues/14#issuecomment-1069391932) three months ago, so that's old news.

Are the precisely the issues you raise in GHC ticket 21715?

Yes, precisely.

There it says "54% slower whereas here you say "18 times slower". That's a big difference. I'm clearly confused

I'm saying "allocates 18 times more", unless I'm missing something that I wrote.

Mikolaj commented 2 years ago

I've repeated the logging of GHC compilation output from https://github.com/Mikolaj/horde-ad/issues/14#issuecomment-1074191438. The difference is I used branch ghc-report-specialize with the testMnist2 SPECIALIZE pragma commented out and I was using HEAD with 7847.

The very problem spotted there seems to be fixed (and ghc-report-specialize is rather more likely to exhibit the problem than extra-pragmas used in the original test (and not compiling with HEAD)). In particular, I'm getting, from the first invocation with -ddump-simpl:

-- RHS size: {terms: 435, types: 555, coercions: 50, joins: 10/27}
testMnist2
  :: forall r.
     (Ord r, Floating r, IsScalar r) =>
     Int -> Int -> [MnistData r] -> Domain r -> r
[GblId,
 Arity=3,
 Str=<MP(A,1CS(C1(L)),A,A,A,A,A,A)><LP(LP(LP(LCS(C1(L)),LCS(C1(L)),LCS(C1(L)),LCS(L),A,A,LCS(L)),LCS(C1(L)),LCS(L),A),A,LCS(L),A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A)><SP(SP(L,SL),SP(SL),L,L,L)>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=NEVER}]
testMnist2
  = \ (@r_a4Ub)
      ($dOrd_a4Uc :: Ord r_a4Ub)
      ($dFloating_a4Ud :: Floating r_a4Ub)
      ($d(%,,,,%)_a4Ue :: IsScalar r_a4Ub) ->
      case ghc-prim:GHC.Types.eq_sel

that doesn't seem to unbox dictionaries (unless I'm looking at a wrong thing) and from second invocation, the one with -ddump-rule-firings:

Rule fired:
    SPEC/BenchMnistTools testMnist2 @Double (BenchMnistTools)

so the specialization seems to work fine.

When the pragma is enabled (and so performance is fine), I'm getting

Rule fired: SPEC testMnist2 (HordeAd.Tool.MnistFcnnScalar)

instead of the SPEC/BenchMnistTools rule firing and I get additionally something inlined from other modules and something to do with w/w

-- RHS size: {terms: 17, types: 66, coercions: 4, joins: 0/0}
HordeAd.Tool.MnistFcnnScalar.testMnist2_paramsL :: DomainL Double
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 45 0}]
HordeAd.Tool.MnistFcnnScalar.testMnist2_paramsL

...

HordeAd.Tool.MnistFcnnScalar.testMnist2_paramsV :: DomainV Double
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 45 0}]
HordeAd.Tool.MnistFcnnScalar.testMnist2_paramsV

...

-- RHS size: {terms: 353, types: 365, coercions: 40, joins: 11/13}
HordeAd.Tool.MnistFcnnScalar.$w$stestMnist2 [InlPrag=[2]]
  :: Int
     -> Int
     -> [MnistData Double]
     -> Domain Double
     -> ghc-prim:GHC.Prim.Double#
[GblId[StrictWorker([~, ~, !, ~])],
 Arity=4,
 Str=<L><L><SL><L>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=NEVER}]
HordeAd.Tool.MnistFcnnScalar.$w$stestMnist2

...

-- RHS size: {terms: 13, types: 8, coercions: 0, joins: 0/0}
HordeAd.Tool.MnistFcnnScalar.testMnist2_$stestMnist2 [InlPrag=[2]]
  :: Int -> Int -> [MnistData Double] -> Domain Double -> Double
[GblId,
 Arity=4,
 Str=<L><L><SL><L>,
 Cpr=1,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=4,unsat_ok=True,boring_ok=False)
         Tmpl= \ (widthHidden_sak0 [Occ=Once1] :: Int)
                 (widthHidden2_sak1 [Occ=Once1] :: Int)
                 (inputs_sak2 [Occ=Once1] :: [MnistData Double])
                 (params_sak3 [Occ=Once1] :: Domain Double) ->
                 case HordeAd.Tool.MnistFcnnScalar.$w$stestMnist2
                        widthHidden_sak0 widthHidden2_sak1 inputs_sak2 params_sak3
                 of ww_sanU [Occ=Once1]
                 { __DEFAULT ->
                 ghc-prim:GHC.Types.D# ww_sanU
                 }}]
HordeAd.Tool.MnistFcnnScalar.testMnist2_$stestMnist2

...

"SPEC testMnist2"
    forall ($d(%,,,,%)7_a4Yk :: IsScalar Double)
           ($dFloating_a4Yj :: Floating Double)
           ($dOrd_a4Yi :: Ord Double).
      testMnist2 @Double $dOrd_a4Yi $dFloating_a4Yj $d(%,,,,%)7_a4Yk
      = HordeAd.Tool.MnistFcnnScalar.testMnist2_$stestMnist2

Diffs of the whole logs are unreadable due to too much randomness.

Edit: fcnn-scalar-no-spec.stg.txt fcnn-scalar-spec.stg.txt

Mikolaj commented 2 years ago

After some more investigation, it seems !7847 really fixes the specialization, after all. See https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847#note_436933.

The corollary is that the workarounds we use are probably sufficient and so we can work with older GHCs by adding workarounds and verifying with HEAD (comparing performance with and without !7847) that the workarounds suffice.

After #11 is implemented, the outrageous BENCH1 performance problems without SPECIALIZE pragmas should vanish and I will try INLINEABLE pragmas as an extra or alternative workaround to try to make benchmarking reliable even for such fragile code as BENCH1, to make sure the #11 optimization results in a real performance improvement and not a measurement error due to GHC quirks.

Another corollary is that other recorded performance oddities (here and in #20), especially those with BENCH2, are most probably real and not caused/obscured by missed specialization, so they can be investigated separately (though perhaps using !7847 to save on workaround pragmas and/or verify none are missing). It is possible some of them are caused by the worker/wrapper clash with -fspecialise-aggressively, but it's possible that -fspecialise-aggressively is generally untested and has other problems as well. But I'd hate to move to annotating every function with INLINABLE (which doesn't suffice in my experience) and/or a collection of SPECIALIZE for several types and/or, as some do, INLINE. IMHO the compiler should be doing that if the developer is fine suffering larger binary sizes and longer compilation times without -O0.

Mikolaj commented 2 years ago

[Edit: I've swapped the order of test result snippets and thaks to that noticed that INLINABLE gives the same results as SPECIALIZE]

I've just performed an experiment suggested by @simonpj: I've added

{-# INLINABLE sumTrainableInputs #-}
{-# INLINABLE sumConstantData #-}

to src/HordeAd/Tool/MnistFcnnScalar.hs and benchmarked various setups with branch ghc-report-specialize.

In short, in BENCH1 this pair of pragmas is exactly exchangeable with the SPECIALIZE testMnist2 pragma and having them both doesn't improve the performance further. Surprisingly, it also significantly affects BENCH2, which is not fragile (it doesn't do useless work, hoping GHC spots it and optimizes away and not affected by SPECIALIZE). Detailed results follow, where ghc-9.5.20220606 is HEAD and ghc-9.5.20220608 is HEAD with !7847 added,

BENCH1

sad baseline: SPECIALIZE testMnist2 pragma is excluded (an no extra pragmas added)

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
 208,919,050,976 bytes allocated in the heap
      10,939,920 bytes copied during GC
      75,684,616 bytes maximum residency (17 sample(s))
       2,016,504 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     49924 colls,     0 par    1.087s   1.094s     0.0000s    0.0012s
  Gen  1        17 colls,     0 par    0.043s   0.043s     0.0025s    0.0044s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   20.689s  ( 20.684s elapsed)
  GC      time    1.130s  (  1.137s elapsed)
  EXIT    time    0.000s  (  0.009s elapsed)
  Total   time   21.819s  ( 21.830s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    10,098,225,306 bytes per MUT second

  Productivity  94.8% of total user, 94.7% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
 207,965,511,848 bytes allocated in the heap
      10,905,056 bytes copied during GC
      75,652,384 bytes maximum residency (12 sample(s))
       2,025,880 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     49678 colls,     0 par    0.897s   0.906s     0.0000s    0.0011s
  Gen  1        12 colls,     0 par    0.026s   0.026s     0.0022s    0.0055s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   16.884s  ( 16.874s elapsed)
  GC      time    0.924s  (  0.932s elapsed)
  EXIT    time    0.000s  (  0.004s elapsed)
  Total   time   17.809s  ( 17.810s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    12,317,082,244 bytes per MUT second

  Productivity  94.8% of total user, 94.7% of total elapsed

happy baseline: SPECIALIZE testMnist2 pragma is included

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
  39,778,850,976 bytes allocated in the heap
       1,055,320 bytes copied during GC
      75,622,104 bytes maximum residency (11 sample(s))
       2,011,136 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9431 colls,     0 par    0.240s   0.239s     0.0000s    0.0014s
  Gen  1        11 colls,     0 par    0.021s   0.021s     0.0019s    0.0045s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   11.961s  ( 11.963s elapsed)
  GC      time    0.260s  (  0.260s elapsed)
  EXIT    time    0.000s  (  0.007s elapsed)
  Total   time   12.221s  ( 12.230s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,325,722,800 bytes per MUT second

  Productivity  97.9% of total user, 97.8% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
  39,685,511,848 bytes allocated in the heap
       1,051,768 bytes copied during GC
      75,622,088 bytes maximum residency (11 sample(s))
       2,010,320 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9409 colls,     0 par    0.221s   0.221s     0.0000s    0.0017s
  Gen  1        11 colls,     0 par    0.023s   0.023s     0.0021s    0.0058s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   11.918s  ( 11.918s elapsed)
  GC      time    0.244s  (  0.244s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time   12.163s  ( 12.170s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,329,799,683 bytes per MUT second

  Productivity  98.0% of total user, 97.9% of total elapsed

SPECIALIZE out, INLINABLE in; the performance is probably the same as with SPECIALIZE, the differences being noise

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
  39,778,850,976 bytes allocated in the heap
       1,055,328 bytes copied during GC
      75,622,104 bytes maximum residency (11 sample(s))
       2,011,136 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9431 colls,     0 par    0.236s   0.236s     0.0000s    0.0011s
  Gen  1        11 colls,     0 par    0.022s   0.023s     0.0021s    0.0061s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   11.960s  ( 11.963s elapsed)
  GC      time    0.258s  (  0.258s elapsed)
  EXIT    time    0.000s  (  0.009s elapsed)
  Total   time   12.218s  ( 12.230s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,326,022,439 bytes per MUT second

  Productivity  97.9% of total user, 97.8% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist  -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
  39,685,511,848 bytes allocated in the heap
       1,051,768 bytes copied during GC
      75,622,088 bytes maximum residency (11 sample(s))
       2,010,320 bytes maximum slop
             157 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9409 colls,     0 par    0.223s   0.222s     0.0000s    0.0013s
  Gen  1        11 colls,     0 par    0.021s   0.022s     0.0020s    0.0047s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   11.933s  ( 11.935s elapsed)
  GC      time    0.244s  (  0.244s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time   12.177s  ( 12.180s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,325,817,377 bytes per MUT second

  Productivity  98.0% of total user, 98.0% of total elapsed

BENCH2

baseline (SPECIALIZE testMnist2 pragma is irrelevant for this benchmark)

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  34,265,500,784 bytes allocated in the heap
         346,392 bytes copied during GC
     120,458,736 bytes maximum residency (190 sample(s))
       2,865,072 bytes maximum slop
             276 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6595 colls,     0 par   14.347s  14.361s     0.0022s    0.0068s
  Gen  1       190 colls,     0 par    3.138s   3.140s     0.0165s    0.0307s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.678s  (  8.661s elapsed)
  GC      time   17.485s  ( 17.501s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time   26.163s  ( 26.170s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,948,643,953 bytes per MUT second

  Productivity  33.2% of total user, 33.1% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  34,036,245,656 bytes allocated in the heap
         354,312 bytes copied during GC
     123,780,712 bytes maximum residency (185 sample(s))
       2,890,416 bytes maximum slop
             269 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6601 colls,     0 par   13.943s  13.957s     0.0021s    0.0060s
  Gen  1       185 colls,     0 par    3.091s   3.093s     0.0167s    0.0304s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.285s  (  8.268s elapsed)
  GC      time   17.034s  ( 17.050s elapsed)
  EXIT    time    0.000s  (  0.002s elapsed)
  Total   time   25.319s  ( 25.320s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,108,024,872 bytes per MUT second

  Productivity  32.7% of total user, 32.7% of total elapsed

INLINABLE pragmas in

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  30,305,288,784 bytes allocated in the heap
         296,776 bytes copied during GC
     119,938,344 bytes maximum residency (176 sample(s))
       2,859,712 bytes maximum slop
             267 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      5609 colls,     0 par   13.246s  13.259s     0.0024s    0.0064s
  Gen  1       176 colls,     0 par    3.305s   3.307s     0.0188s    0.0310s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.121s  (  8.109s elapsed)
  GC      time   16.551s  ( 16.565s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time   24.672s  ( 24.680s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,731,873,708 bytes per MUT second

  Productivity  32.9% of total user, 32.9% of total elapsed

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  30,288,245,656 bytes allocated in the heap
         296,800 bytes copied during GC
     119,971,584 bytes maximum residency (182 sample(s))
       2,897,656 bytes maximum slop
             271 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      5604 colls,     0 par   13.456s  13.469s     0.0024s    0.0071s
  Gen  1       182 colls,     0 par    3.159s   3.162s     0.0174s    0.0307s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.168s  (  8.155s elapsed)
  GC      time   16.616s  ( 16.631s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time   24.784s  ( 24.790s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,708,312,904 bytes per MUT second

  Productivity  33.0% of total user, 32.9% of total elapsed

Mikolaj commented 2 years ago

To help with https://gitlab.haskell.org/ghc/ghc/-/issues/21715, where GHC 9.0.2 is reported to be 18% faster than HEAD in BENCH2, I tested the INLINABLE pragmas with GHC 9.0.2.

baseline (SPECIALIZE testMnist2 pragma is irrelevant for this benchmark); the speedup and lower allocation over HEAD is reproduced:

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  29,993,632,336 bytes allocated in the heap
  36,535,814,368 bytes copied during GC
     119,503,384 bytes maximum residency (253 sample(s))
       2,919,120 bytes maximum slop
             268 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22918 colls,     0 par   11.925s  11.938s     0.0005s    0.0016s
  Gen  1       253 colls,     0 par    1.955s   1.956s     0.0077s    0.0318s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    8.028s  (  8.026s elapsed)
  GC      time   13.880s  ( 13.895s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   21.908s  ( 21.921s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,736,110,581 bytes per MUT second

  Productivity  36.6% of total user, 36.6% of total elapsed

INLINABLE pragmas in, leading to a disaster

~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
  59,994,964,336 bytes allocated in the heap
  36,165,580,184 bytes copied during GC
      87,233,960 bytes maximum residency (260 sample(s))
       2,557,304 bytes maximum slop
             186 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     51411 colls,     0 par   12.601s  12.623s     0.0002s    0.0021s
  Gen  1       260 colls,     0 par    1.596s   1.598s     0.0061s    0.0086s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   21.195s  ( 21.193s elapsed)
  GC      time   14.198s  ( 14.220s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   35.392s  ( 35.413s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,830,684,351 bytes per MUT second

  Productivity  59.9% of total user, 59.8% of total elapsed

Edit: For completeness, just as with other GHC versions, in the presence of SPECIALIZE, the INLINABLE pragmas don't affect GHC 9.0.2 in the BENCH1 test. However, unlike with HEAD, INLINABLE doesn't repair performance in the absence of SPECIALIZE and has a minimal impact in that case.

Mikolaj commented 2 years ago

Since GHC 9.0.2 reacts badly to INLINABLE (see the previous comment), there goes my idea to supplement -fexpose-all-unfoldings -fspecialise-aggressively with INLINABLE for each function in the codebase and never look back (and remove all SPECIALIZE and proxies needed by them and manual inlines needed by them, if any). I think I will stick with -fexpose-all-unfoldings -fspecialise-aggressively (removing it breaks performance) and with a few SPECIALIZE and test with !7847 whether we have enough of the latter and do A/B tests about whether any extra INLINABLE are needed and also compare results between GHC HEAD, 9.0.2 and GHC 9.2.3, with and without various sets of INLINABLE, and eventually benchmark using the GHC version that produces the fastest code for a test in hand.

Mikolaj commented 2 years ago

This is no longer a blocker, because we understand what's going and we've removed the code that was particularly susceptible to the performance swings.

What's going on is

missed specializations, but that is fixed in HEAD plus !7847 and we are confident that enough SPECIALIZE (perhaps with proxies or manual inlining) works around it
demand analysis woes 21736 and/or worker-wrapper not commuting with specialization and/or whatever other component of the fiasco is crucial is going on, but this catastrophically affects only naive and fragile code that offloads too much of its optimization to GHC

We've removed the code that was particularly susceptible to the performance swings in #11. The code was recursively accumulating delta expressions in the second component of dual numbers and then ignoring them, because it was only computing the value of the objective function, not a gradient. The new code consistently and repeatedly puts () into the second component of dual numbers. This is still wasteful and GHC apparently sometimes optimizes it away and sometimes refuses to, but that's a 20% runtime swing (with HEAD+7847) and no allocation difference, as opposed to many-fold differences.

However, note that this code is no longer naive and exotic, so it's likely a similar potential 20% speedup comes up in the wild often enough and, regardless of speed gains, performance consistency is very valuable, because 20% may easily mask a good algorithmic decision elsewhere and lead to performance-harming decisions due to the noise in benchmarks. BTW, 9.2.3 is a further 20% faster in exactly the same benchmark, see also 21715, which reports similar results with the fragile version of the code.

Anyway, let's keep this ticket open to remember to extend the workarounds to other benchmarks and maintain them as the code changes and avoid performance-wise naive code in order to prevent benchmarking noise.

Mikolaj commented 1 year ago

I can no longer reproduce these problems with GHC 9.6.1, though 9.4.4 is still heavily affected (most recently, these SPECIALIZE pragmas were needed https://github.com/Mikolaj/horde-ad/blob/a361d2694df4f181ee6256ac8d0748e66d5896d4/simplified/HordeAd/Core/AstInterpret.hs#L321-L426 to avoid 50% slowdown vs 9.6.1, but some must have been missed, because tests are still 20% slower on 9.4.4). The only remaining related performance discrepancies (reported as GHC tickets) are all around 10% runtime overhead, so surely not stemming from lack of specialization of any function on the hot path (since it tends to automatically disable specialization of all other functions that depend on the unspecialized code).

Success. Closing.