Closed Mikolaj closed 1 year ago
Update re GHC 9.2.2:
OTOH 9.2.1 spams so much that it's hard to tell; we should try 9.2.2 once it's out).
GHC 9.2.2 seems to spam just as much, which makes -Wall-missed-specialisations
unusable. I think the fix was not backported to 9.2.2, so we'd need to check master branch. The offending functions are probably mentioned in the -Wall-missed-specialisations
output, but it's hard to pick up from the noise --- definitely there is a difference in size between the output with and without specialization, but the diffs are big.
I've reproduced exactly the 2 scenarios with 9.2.2, with the same results as with 9.2.1.
The fix for too verbose -Wall-missed-specialisations
was back-ported, after all:
https://gitlab.haskell.org/ghc/ghc/-/issues/20744#note_412503
and when I looked more closely, indeed the spam is much less. However, I could not find the problematic functions, those for which SPECIALIZE was necessary, in the log. OTOH, -Wmissed-specialisations
produces no warnings at all.
I got stuck on teh cabal build
step with
Failed to build zlib-0.6.2.3. The failure occurred during the configure step.
Build log (
/home/simonpj/.cabal/logs/ghc-8.10.7/zlib-0.6.2.3-def443548e978facc1ac2e0de278eca4693b87b4f14d6e73f4b00178d6270d82.log
):
Configuring library for zlib-0.6.2.3..
cabal-3.6.2.0: Missing dependency on a foreign library:
* Missing (or bad) header file: zlib.h
* Missing (or bad) C library: z
This problem can usually be solved by installing the system package that
provides this library (you may need the "-dev" version). If the library is
already installed but in a non-standard location then you can use the flags
--extra-include-dirs= and --extra-lib-dirs= to specify where it is.If the
library file does exist, it may contain errors that are caught by the C
compiler at the preprocessing stage. In this case you can re-run configure
with the verbosity flag -v3 to see the error messages.
If the header file does exist, it may contain errors that are caught by the C
compiler at the preprocessing stage. In this case you can re-run configure
with the verbosity flag -v3 to see the error messages.
Not sure what to do next. I guess I have to install something -- but what?
If you're on an Debian-based system (e.g. Ubuntu) then
sudo apt install zlib1g-dev
Later on there are also harder deps to fill described here: https://github.com/Mikolaj/horde-ad/blob/master/README.md#compilation-from-source
BTW, the second commit (with the two SPECIALIZE) produces such crazy results (best to just do git checkout ghc-report-specialize
; master branch has similar results) in BENCH2:
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
720,813,256,744 bytes allocated in the heap
17,976,640 bytes copied during GC
119,619,152 bytes maximum residency (1882 sample(s))
2,890,176 bytes maximum slop
278 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 158435 colls, 0 par 158.464s 158.741s 0.0010s 0.0071s
Gen 1 1882 colls, 0 par 31.475s 31.519s 0.0167s 0.0331s
INIT time 0.000s ( 0.000s elapsed)
MUT time 274.205s (274.402s elapsed)
GC time 189.940s (190.261s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 464.145s (464.663s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,628,741,040 bytes per MUT second
Productivity 59.1% of total user, 59.1% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
287,734,398,208 bytes allocated in the heap
382,310,058,648 bytes copied during GC
124,421,656 bytes maximum residency (2157 sample(s))
2,946,616 bytes maximum slop
275 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 219047 colls, 0 par 122.137s 122.312s 0.0006s 0.0026s
Gen 1 2157 colls, 0 par 32.398s 32.433s 0.0150s 0.0311s
INIT time 0.000s ( 0.000s elapsed)
MUT time 81.031s ( 81.032s elapsed)
GC time 154.535s (154.746s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 235.566s (235.778s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,550,936,491 bytes per MUT second
Productivity 34.4% of total user, 34.4% of total elapsed
I'm just trying to unravel that. It may be related to laziness and the dirty tricks criterion
does to make sure it measures only the benchmarked code, excluding the setup code. However, just adding full deepseq forcing to the benchmarked function doesn't change anything [edit: pushed to branch ghc-report-specialize and verified]. The wild differences in allocation and GC may also be related to whether hmatrix allocates stuff on the heap or somewhere in C space, or to pinned/unpinned, but I don't know of any recent changes in GHC or libraries about that.
Edit: GHC 8.10.7 gives almost the same results as GHC 9.0.2.
Edit2: and the fact that I can only use GHC 9.2 in the recent code (due to type-level pattern matching that is only in 9.2. see #13), makes it so much harder to compare perf and discover particular GHC version flukes vs. consistent perf changes.
Edit3: perhaps what's going is that GHC 9.2.2 (or some newer library versions enabled by GHC 9.2.2) fail to fuse, e.g., vectors. That can cause much more allocation (allocation done not by GC, to be clear) and so much higher mutation time (worse cache locality), despite less time spent in GC. However, why failed fusion would cause so much less time spent in GC eludes me. Unless this has an independent cause.
Edit4: this ticket is important also because our benchmarks vs other Haskell libraries are worth nothing if we have such huge unexplained anomalies. Benchmarks against Python/C libraries may also give a wrong impression if the Haskell side is handicapped for no fundamental reason.
Edit5: the -h
heap profile shows a big difference in allocating the HordeAd.Core.Delta.DeltaScalar
values, which are gathered in the benchmarked function, but never unused. Perhaps they are evaluated in GHC 9.2.2, because the HordeAd.Core.DualNumber.DualNumber
datatype is strict (StrictData
), but not in GHC 9.0.2, because they are unused? Perhaps GHC 9.2.2 has a more aggressive (more correct?) strictness analysis? However, making the relevant filed lazy
data DualNumber a = D a ~(DeltaExpression a)
makes the -s
report for GHC 9.2.2 significantly worse rather than better (which is counterintuitive). So perhaps GHC 9.2.2 is worse at detecting and ignoring dead code?
The result of mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s -hi -l-aug
(BENCH2) from branch ghc-report-specialize with GHC 9.2.2 doesn't show much:
Profiling output (on master branch, commit 1f7926f6bf039163e67238a6bfcb3bfdbd6d9fe9, performance is similar, but profile looks quite different to branch ghc-report-specialize, after both diverged a bit lately, but that's probably just a difference in how profiling skews results) for the 9.2.2 run above (the one that is twice slower and allocates 3 times more than 9.0.2, but does almost no GC) does not confirm the failed fusion hypothesis. It assigns most of allocations to this code that should not permanently allocate anything (so perhaps there are some transient allocations? or is ST
broken with GHC 9.2.2?):
COST CENTRE MODULE SRC %time %alloc
buildVectors.eval0 HordeAd.Core.Delta src/HordeAd/Core/Delta.hs:(339,7)-(350,41) 22.0 7.4
var HordeAd.Core.PairOfVectors src/HordeAd/Core/PairOfVectors.hs:57:1-70 15.0 8.9
scale HordeAd.Core.DualNumber src/HordeAd/Core/DualNumber.hs:108:1-42 12.1 5.3
sumConstantData HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:(47,1)-(53,33) 11.1 10.7
+ HordeAd.Core.DualNumber src/HordeAd/Core/DualNumber.hs:50:3-42 9.7 6.3
buildVectors.eval0.\ HordeAd.Core.Delta src/HordeAd/Core/Delta.hs:342:21-42 7.4 23.6
sumConstantData.f HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:(50,7)-(52,26) 5.4 0.0
sumConstantData.f.v HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:51:13-46 4.3 6.4
* HordeAd.Core.DualNumber src/HordeAd/Core/DualNumber.hs:52:3-64 2.7 4.4
buildVectors.eval0.\ HordeAd.Core.Delta src/HordeAd/Core/Delta.hs:343:29-52 2.6 7.4
buildVectors.eval0.\ HordeAd.Core.Delta src/HordeAd/Core/Delta.hs:341:23-37 2.1 11.8
Eventlog profiling and heap profiles show no relevant thunks (on master branch I squelched most innocent thunks just to be sure).
Profiling output from branch ghc-report-specialize with GHC 9.2.2 shows a more believable story, but still no explanation for the difference vs 9.0.2 and still no support for the broken vector fusion hypothesis:
COST CENTRE MODULE SRC %time %alloc
buildVectors.eval0 HordeAd.Core.Delta src/HordeAd/Core/Delta.hs:(171,7)-(179,60) 21.3 0.0
var HordeAd.Core.PairOfVectors src/HordeAd/Core/PairOfVectors.hs:50:1-64 17.4 17.9
scale HordeAd.Core.DualNumber src/HordeAd/Core/DualNumber.hs:99:1-42 13.6 10.7
sumConstantData HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:(47,1)-(53,33) 13.0 21.4
+ HordeAd.Core.DualNumber src/HordeAd/Core/DualNumber.hs:41:3-42 11.3 12.8
sumConstantData.f HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:(50,7)-(52,26) 5.3 0.0
sumConstantData.f.v HordeAd.Tool.MnistFcnnScalar src/HordeAd/Tool/MnistFcnnScalar.hs:51:13-46 4.7 12.9
* HordeAd.Core.DualNumber src/HordeAd/Core/DualNumber.hs:43:3-64 3.1 8.8
Analogous profile for GHC 9.0.2 is quite similar.
@simonpj: A remark about the code that needs the two SPECIALIZE pragmas (BENCH1): this is a thick code path, where most of the code is not used (gradient and object function result are computed, the gradient is thrown away and only the result is kept). So perhaps the specialization is lost when some parts of the code are simplified away? Or some types are not instantiated, because the gradient is thrown away, and this prevents specialization somehow?
Another hint: a part of the problem goes away (the performance is improve partially) when an associated type (DeltaExpression
) is manually resolved and inlined in some function signature.
@simonpj: Might this be related: https://gitlab.haskell.org/ghc/ghc/-/issues/20709 ? If so, should I compile HEAD, try to compile horde-ad and check if any of the errors above is fixed?
Or is there a way to inhibit worker/wrapper or inlining of the wrapper?
If so, should I compile HEAD, try to compile horde-ad and check if any of the errors above is fixed?
It would be great if you could do that. I still have no idea if it is related thought
First results of BENCH2 with HEAD and head.hackage: the runtime and allocation regression in 9.2.2 is mostly gone [Edit: this is correct; only 14% regression in time and alloc remains]]:
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
34,321,569,296 bytes allocated in the heap
753,488 bytes copied during GC
120,328,568 bytes maximum residency (193 sample(s))
2,872,192 bytes maximum slop
255 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6592 colls, 0 par 14.090s 14.108s 0.0021s 0.0057s
Gen 1 193 colls, 0 par 2.845s 2.848s 0.0148s 0.0314s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.335s ( 8.320s elapsed)
GC time 16.935s ( 16.957s elapsed)
EXIT time 0.000s ( 0.004s elapsed)
Total time 25.270s ( 25.280s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,117,838,210 bytes per MUT second
Productivity 33.0% of total user, 32.9% of total elapsed
This is much closer to 9.2.2 results now. One thing that didn't change is copied during GC
, but that's probably a positive.
Edit: this is solved most probably by the fix to https://gitlab.haskell.org/ghc/ghc/-/issues/20364 / https://gitlab.haskell.org/ghc/ghc/-/issues/20709 that is in HEAD.
And with GHC HEAD there is no difference in -s
output after removing the two SPECIALIZE
pragmas. [Edit: I was wrong, because I used a wrong test (BENCH2, the one for the 9.2.2 regression). Below I confirm the problem with BENCH1 and needing the SPECIALIZE pragmas is only partially solved.]
The instructions to reproduce the HEAD experiment [the instructions are for BENCH1 and I should have used that, but I used BENCH2 instead];
git clone git@github.com:Mikolaj/horde-ad.git
git checkout ghc-report-specialize
build HEAD with Hadrian
add cabal.project.local
file with (see http://ghc.gitlab.haskell.org/head.hackage)
repository head.hackage.ghc.haskell.org
url: https://ghc.gitlab.haskell.org/head.hackage/
secure: True
key-threshold: 3
root-keys:
7541f32a4ccca4f97aea3b22f5e593ba2c0267546016b992dfadcd2fe944e55d
26021a13b401500c8eb2761ca95c61f2d625bfef951b939a8124ed12ecf07329
f76d08be13e9a61a377a85e2fb63f4c5435d40f8feb3e12eb05905edb8cdea89
cabal update
cabal build -w /path_to_ghc/_build/stage1/bin/ghc --allow-newer --enable-benchmarks
and then un the BENCH1 benchmark, substituting a correct path of the binary
/home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220610/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
And with GHC HEAD there is no difference -s output after removing the two SPECIALIZE pragmas.
Do you mean: with HEAD the problem is solved?
And with GHC HEAD there is no difference -s output after removing the two SPECIALIZE pragmas.
Do you mean: with HEAD the problem is solved?
[Edit: I was wrong, because I used a wrong test (BENCH2, the one for the 9.2.2 regression). The problem is only partially solved.] The problem that SPECIALIZE
is needed for BENCH1 despite -fexpose-all-unfoldings -fspecialise-aggressively
is completely solved. The problem that there is no warning about the lack of specialization despite -Wall-missed-specialisations
or despite -Wmissed-specialisations
(which produces no warning at all) is probably not solved, but no way to reproduce it now. The 2-fold runtime and 3-fold alloc regression in 9.2.2 vs 9.0.2 in BENCH2 is solved partially (there is 14% time and alloc regression still).
The 2-fold time and 3-fold alloc regression in 9.2.2 vs 9.0.2 is solved partially (there is 14% time and alloc regression still).
So HEAD is still 14% slower than 9.0? Can you give specific repro instructions? Thanks
So HEAD is still 14% slower than 9.0? Can you give specific repro instructions? Thanks
Yes, I just repeated the tests. The instructions are as above, but run 500/500
(BENCH2) instead of 500/test 500
(BENCH1). The same with 9.0.2.
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
34,321,569,296 bytes allocated in the heap
753,496 bytes copied during GC
120,328,568 bytes maximum residency (193 sample(s))
2,872,192 bytes maximum slop
255 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6592 colls, 0 par 14.022s 14.043s 0.0021s 0.0062s
Gen 1 193 colls, 0 par 2.969s 2.973s 0.0154s 0.0322s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.326s ( 8.314s elapsed)
GC time 16.991s ( 17.016s elapsed)
EXIT time 0.000s ( 0.010s elapsed)
Total time 25.317s ( 25.340s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,122,453,209 bytes per MUT second
Productivity 32.9% of total user, 32.8% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
29,993,632,336 bytes allocated in the heap
36,535,814,368 bytes copied during GC
119,503,384 bytes maximum residency (253 sample(s))
2,919,120 bytes maximum slop
268 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 22918 colls, 0 par 11.927s 11.945s 0.0005s 0.0016s
Gen 1 253 colls, 0 par 1.969s 1.972s 0.0078s 0.0307s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.982s ( 7.983s elapsed)
GC time 13.896s ( 13.917s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 21.878s ( 21.900s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,757,806,371 bytes per MUT second
Productivity 36.5% of total user, 36.5% of total elapsed
BTW, in some similar tests earlier on I had 20% speedup in 9.2.2 vs 9.0.2, so this may be a problem that eats up the 20% improvement (e.g., from the low GC copying) and adds 14% degradation on top.
Actually, for some reason I'm running a different test (BENCH2 instead of BENCH1). Let me retry with the one from the description of this ticket (BENCH1).
I apologise. I forgot there are two different tests, one for the 2 SPECIALIZE pragmas problem (BENCH1, present in 8.10.7, 9.0.2, 9.2.2, partially in HEAD), another for the slowdown and allocation blowup problem with the pragmas in place (BENCH2, regression in 9.2.2, fixed in HEAD, except for 14%). I was right HEAD mostly fixes the latter problem (14% slowdown and blowup remains). I was wrong that it fixes the former, because I used the wrong test to verify that.
The 2 SPECIALIZE pragmas problem in BENCH1 is only partially mitigated in HEAD. Also some new problems seem to be present in HEAD that were not in 9.2.2.
Here are the results for BENCH1 from branch ghc-report-specialize
(meaning, with the 2 SPECIALIZE pragmas):
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
39,694,459,432 bytes allocated in the heap
1,054,624 bytes copied during GC
75,629,632 bytes maximum residency (11 sample(s))
2,014,488 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9411 colls, 0 par 0.223s 0.223s 0.0000s 0.0013s
Gen 1 11 colls, 0 par 0.022s 0.022s 0.0020s 0.0052s
INIT time 0.000s ( 0.000s elapsed)
MUT time 11.956s ( 11.961s elapsed)
GC time 0.245s ( 0.244s elapsed)
EXIT time 0.000s ( 0.004s elapsed)
Total time 12.202s ( 12.210s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,319,972,986 bytes per MUT second
Productivity 98.0% of total user, 98.0% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
2,208,754,088 bytes allocated in the heap
30,112 bytes copied during GC
71,871,304 bytes maximum residency (9 sample(s))
1,610,936 bytes maximum slop
144 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 473 colls, 0 par 0.030s 0.030s 0.0001s 0.0012s
Gen 1 9 colls, 0 par 0.013s 0.013s 0.0014s 0.0039s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.680s ( 7.713s elapsed)
GC time 0.043s ( 0.043s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 7.723s ( 7.756s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 287,584,323 bytes per MUT second
Productivity 99.4% of total user, 99.4% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
2,463,534,528 bytes allocated in the heap
31,052,832 bytes copied during GC
63,802,840 bytes maximum residency (10 sample(s))
939,224 bytes maximum slop
129 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2173 colls, 0 par 0.050s 0.050s 0.0000s 0.0011s
Gen 1 10 colls, 0 par 0.011s 0.011s 0.0011s 0.0035s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.539s ( 8.554s elapsed)
GC time 0.061s ( 0.061s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 8.600s ( 8.615s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 288,502,093 bytes per MUT second
Productivity 99.3% of total user, 99.3% of total elapsed
And here are results with both SPECIALIZE pragmas removed (not other change):
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
208,834,659,432 bytes allocated in the heap
10,947,536 bytes copied during GC
75,686,056 bytes maximum residency (16 sample(s))
2,021,696 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 49915 colls, 0 par 1.063s 1.068s 0.0000s 0.0012s
Gen 1 16 colls, 0 par 0.041s 0.041s 0.0026s 0.0060s
INIT time 0.000s ( 0.000s elapsed)
MUT time 16.940s ( 16.943s elapsed)
GC time 1.105s ( 1.109s elapsed)
EXIT time 0.000s ( 0.008s elapsed)
Total time 18.045s ( 18.060s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 12,328,026,045 bytes per MUT second
Productivity 93.9% of total user, 93.8% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
641,729,879,528 bytes allocated in the heap
24,952,744 bytes copied during GC
75,682,672 bytes maximum residency (16 sample(s))
2,022,216 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 153353 colls, 0 par 2.523s 2.532s 0.0000s 0.0012s
Gen 1 16 colls, 0 par 0.041s 0.041s 0.0026s 0.0057s
INIT time 0.000s ( 0.000s elapsed)
MUT time 201.875s (202.046s elapsed)
GC time 2.564s ( 2.573s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 204.440s (204.619s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,178,841,045 bytes per MUT second
Productivity 98.7% of total user, 98.7% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
509,766,256,688 bytes allocated in the heap
13,188,041,760 bytes copied during GC
75,670,952 bytes maximum residency (18 sample(s))
2,020,440 bytes maximum slop
153 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 486527 colls, 0 par 6.651s 6.719s 0.0000s 0.0011s
Gen 1 18 colls, 0 par 0.042s 0.042s 0.0023s 0.0051s
INIT time 0.000s ( 0.000s elapsed)
MUT time 134.245s (134.266s elapsed)
GC time 6.693s ( 6.761s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 140.938s (141.027s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,797,286,411 bytes per MUT second
Productivity 95.3% of total user, 95.2% of total elapsed
I've repeated the SPECIALIZE test BENCH1 in GHC 9.2.2 with {-# OPTIONS_GHC -fno-worker-wrapper #-}
in the two files where SPECIALIZE pragmas were located. I'm getting the same '-s' report with the 2 SPECIALIZE pragmas present and without the pragmas. That means that either -fno-worker-wrapper
breaks specialization or that it actually fixes the very two specializations that the two pragmas were forcing. Both runtime and allocation is in-between the extremes from the previous comment, so it's likely -fno-worker-wrapper
helps with the specialization, but impedes performance elsewhere. That would confirm the connection to https://gitlab.haskell.org/ghc/ghc/-/issues/20364 and https://gitlab.haskell.org/ghc/ghc/-/issues/20709.
@simonpj: may I help with this one any more?
Notes after a call with Mikolaj. (@Mikolaj please feel free to edit) [MK: edited; and again]
Current status
Before recent changes After recent changes and with a few pragmas for BENCH1
on branch ghc-report-specialize on branch extra-pragmas
(in parens results with a dozen extra pragmas for BENCH2
on branch extra-pragmas-BENCH2)
BENCH1 SPECIALISE no SPEC prags SPECIALISE no SPEC prags
prags -fspec-agg etc prags
8.10.7 Fast Slow Fast Slow
9.0.2 Fast Slow Slow+ Slow
9.2.2 Fast Slow Fast Slow
HEAD Fast- Slow+ ? ?
BENCH2 SPECIALISE no SPEC prags SPECIALISE no spec prags
prags -fspec-agg etc prags
8.10.7 Fast Slow (Slow+)
9.0.2 Fast Slow (Slow+)
9.2.2 Slow Slow (Slow)
HEAD Fast- ?
Fast- is less fast and Slow+ is less slow, but there's still a huge difference between the two.
Can't test with HEAD in branch extra-pragmas
because it doesn't work
with typechecker plugins. Probably just needs updates of the plugins,
in head.hackage.
Key SPECIALISE pragma is in src/HordeAd/Tool/MnistFcnnScalar.hs, function
{-# SPECIALIZE testMnist2 :: Int -> Int -> [MnistData Double] -> Domain Double -> Double #-}
In particular, only this pragma has some effect on branch extra-pragmas
.
That function fires when compiling bench/BenchMnistTools.hs
[1] Command line to compile src/HordeAd/Tool/MnistFcnnScalar.hs
[2] Command line to compile bench/BenchMnistTools.hs
[1] /home/simonpj/code/HEAD-5/inplace/bin/ghc-stage2 -c -fbuilding-cabal-package -O -static -dynamic-too -dynosuf dyn_o -dynhisuf dyn_hi -outputdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -odir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -hidir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -stubdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -i -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -isrc -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -optP-include -optP/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen/cabal_macros.h -this-unit-id horde-ad-0.1.0.0-inplace -hide-all-packages -Wmissing-home-modules -no-user-package-db -package-db /home/simonpj/.cabal/store/ghc-9.3.20220316/package.db -package-db /home/simonpj/code/horde-ad/dist-newstyle/packagedb/ghc-9.3.20220316 -package-db /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/package.conf.inplace -package-id assert-failure-0.1.2.5-ea675a91e1e142052b8c09a6b71455a91a11c690fd2d01a36ad7f4cb68430354 -package-id base-4.16.0.0 -package-id bytestring-0.11.2.0 -package-id hmatrix-0.20.2-10eb67f511637ef41e2613b440db02f1222c9526116b87f7594e67978da7419b -package-id mnist-idx-0.1.3.0-93e628f7c64901a8a0e6eca70046a8335a120ecf78872be7a9055068328d629f -package-id pretty-show-1.10-ec297dfb3c4a097ca84c313cd9d74af54aa9c25eebb805b0f274cb7b06664ee1 -package-id random-1.2.1-dff0ac74d3869fa8e6c81ef2173a538040d7f12e33a327a869fd889135fe0d0a -package-id strict-containers-0.1-39349c55f55f295c5b921956ce232b6925f1851889adcb02cbb2cf17a2a8c34a -package-id transformers-0.5.6.2 -package-id vector-0.12.3.1-29f10f436fed96d2e5db20c8b101f6ed402d245d4282e430571bfefe8a438d14 -package-id zlib-0.6.2.3-c4451d988dbea589a8ceb6952bdc1d6e539a0b6208ae97abe04d017e7edb9c13 -XHaskell2010 -XMonoLocalBinds -XScopedTypeVariables -XOverloadedStrings -XBangPatterns -XRecordWildCards -XNamedFieldPuns -XMultiWayIf -XLambdaCase -XDefaultSignatures -XInstanceSigs -XPatternSynonyms -XStrictData -XTypeApplications -XFlexibleContexts src/HordeAd/Tool/MnistFcnnScalar.hs -Wall -Wcompat -Worphans -Wincomplete-uni-patterns -Wincomplete-record-updates -Wimplicit-prelude -Wmissing-home-modules -Widentities -Wredundant-constraints -Wmissing-export-lists -Wpartial-fields -Wunused-packages -fno-ignore-asserts -fexpose-all-unfoldings -fslpecialise-aggressively '-fsimpl-tick-factor=200' -Wall-missed-specialisations -hide-all-packages -fforce-recomp -ddump-simpl >& fcnn-scalar-no-spec.stg
[2] /home/simonpj/code/HEAD-5/inplace/bin/ghc-stage2 -c -no-link -fbuilding-cabal-package -O -static -dynamic-too -dynosuf dyn_o -dynhisuf dyn_hi -outputdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -odir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -hidir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -stubdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -i -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -ibench -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -optP-include -optP/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen/cabal_macros.h -hide-all-packages -Wmissing-home-modules -no-user-package-db -package-db /home/simonpj/.cabal/store/ghc-9.3.20220316/package.db -package-db /home/simonpj/code/horde-ad/dist-newstyle/packagedb/ghc-9.3.20220316 -package-db /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/package.conf.inplace -package-id base-4.16.0.0 -package-id criterion-1.5.13.0-5c0f15f9ba93e89d1ffb286bc5c4858ea41ec14be80fab0303ea7a74d39e171e -package-id deepseq-1.4.7.0 -package-id horde-ad-0.1.0.0-inplace -package-id random-1.2.1-dff0ac74d3869fa8e6c81ef2173a538040d7f12e33a327a869fd889135fe0d0a -package-id vector-0.12.3.1-29f10f436fed96d2e5db20c8b101f6ed402d245d4282e430571bfefe8a438d14 -XHaskell2010 -XMonoLocalBinds -XScopedTypeVariables -XOverloadedStrings -XBangPatterns -XRecordWildCards -XNamedFieldPuns -XMultiWayIf -XLambdaCase -XDefaultSignatures -XInstanceSigs -XPatternSynonyms -XStrictData -XTypeApplications -XFlexibleContexts bench/BenchMnistTools.hs -Wall -Wcompat -Worphans -Wincomplete-uni-patterns -Wincomplete-record-updates -Wimplicit-prelude -Wmissing-home-modules -Widentities -Wredundant-constraints -Wmissing-export-lists -Wpartial-fields -Wunused-packages -fno-ignore-asserts -fexpose-all-unfoldings -fspecialise-aggressively '-fsimpl-tick-factor=200' -Wall-missed-specialisations -rtsopts -hide-all-packages -fforce-recomp -ddump-rule-firings -fforce-recomp >& foo-rule-fired
Notes from today's meeting (more in the google doc): the current idea is "never unbox dictionaries in w/w" and the agreed branch is GHC 9.2 (but branch 9.0 would not be a bad choice, either, because it's only one on which the original workaround fails after recent changes to the codebase, even with extra pragmas and proxies for them (can't compile on HEAD); or perhaps I can backport from 9.2 to 9.0 for my tests).
If backporting the fix to 9.2 (and 9.0?) is too hard, an alternative is to port get horde-ad buildable on HEAD, which requires porting https://hackage.haskell.org/package/ghc-tcplugins-extra (which may be enough or not to get ghc-typelits-knownnat and ghc-typelits-natnormalise ported) to HEAD and submiting a merge request to https://gitlab.haskell.org/ghc/head.hackage (and to the ghc-tcplugins-extra repo to let them know).
Edit: https://github.com/clash-lang/ghc-tcplugins-extra/issues/21
I wasn't able to port https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 to GHC 9.2 and the type-level plugins we use are not ported to 9.4 nor HEAD yet, so I'm testing with the same old branch ghc-report-specialize
as before that did not yet use the plugins. Let me first record all the data I gathered and summarize later on in another comment.
I fully reproduced with GHC 9.2.3 the recorded results with GHC 9.2.2 and BENCH1 and BENCH2 (the recorded data was for 10 runs of BENCH2 on a slightly older commit, but it matches well enough, considering). I also reproduced that removing the two SPECIALIZE pragmas from the program wreaks performance with BENCH1, but does not affect BENCH2, just as before. Here are the full results with the pragmas (BENCH1, then BENCH2):
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
2,208,754,088 bytes allocated in the heap
30,120 bytes copied during GC
71,871,304 bytes maximum residency (9 sample(s))
1,610,936 bytes maximum slop
144 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 473 colls, 0 par 0.030s 0.030s 0.0001s 0.0013s
Gen 1 9 colls, 0 par 0.013s 0.013s 0.0014s 0.0037s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.814s ( 7.816s elapsed)
GC time 0.042s ( 0.042s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 7.856s ( 7.858s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 282,680,033 bytes per MUT second
Productivity 99.5% of total user, 99.5% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
73,220,358,784 bytes allocated in the heap
1,808,752 bytes copied during GC
119,593,208 bytes maximum residency (201 sample(s))
2,928,528 bytes maximum slop
262 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 16080 colls, 0 par 13.746s 13.759s 0.0009s 0.0030s
Gen 1 201 colls, 0 par 2.945s 2.946s 0.0147s 0.0279s
INIT time 0.000s ( 0.000s elapsed)
MUT time 26.141s ( 26.144s elapsed)
GC time 16.691s ( 16.705s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 42.832s ( 42.849s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,800,962,649 bytes per MUT second
Productivity 61.0% of total user, 61.0% of total elapsed
Then I tried GHC 9.5.20220606 that contains https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7985 and https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7997, but not https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847. The results recorded for ghc-9.3.20220316 match very closely (within 1% plus error) both in BENCH1 and BENCH2. I also reproduced that removing the two SPECIALIZE pragmas from the program wreaks performance with BENCH1, but does not affect BENCH2, exactly as before. Here are the full results with the pragmas (BENCH1, then BENCH2):
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
39,778,850,976 bytes allocated in the heap
1,055,328 bytes copied during GC
75,622,104 bytes maximum residency (11 sample(s))
2,011,136 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9431 colls, 0 par 0.220s 0.221s 0.0000s 0.0020s
Gen 1 11 colls, 0 par 0.022s 0.022s 0.0020s 0.0062s
INIT time 0.000s ( 0.000s elapsed)
MUT time 11.872s ( 11.876s elapsed)
GC time 0.243s ( 0.243s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 12.115s ( 12.120s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,350,632,757 bytes per MUT second
Productivity 98.0% of total user, 98.0% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
34,265,500,784 bytes allocated in the heap
346,384 bytes copied during GC
120,458,736 bytes maximum residency (190 sample(s))
2,865,072 bytes maximum slop
276 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6595 colls, 0 par 14.240s 14.264s 0.0022s 0.0060s
Gen 1 190 colls, 0 par 3.091s 3.096s 0.0163s 0.0298s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.555s ( 8.549s elapsed)
GC time 17.331s ( 17.360s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 25.886s ( 25.910s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,005,410,638 bytes per MUT second
Productivity 33.0% of total user, 33.0% of total elapsed
Then I tried the same master branch of GHC but with cherry-picked all 3 commits from https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847, called 9.5.20220608. The results are the same as for GHC 9.5.20220606 with both BENCH1 and BENCH2, except that after removing the two SPECIALIZE pragmas from the program, the results do not change at all [edit: unfortunately, I can't reproduce this now, however hard I try; the lack of specialization seems to persist]. Therefore, the main issue from this ticket, the need for the pragmas despite -fexpose-all-unfoldings -fspecialise-aggressively
, is completely gone [edit: alas, probably not].
Summarizing, main issue from this ticket, the need for the pragmas despite -fexpose-all-unfoldings -fspecialise-aggressively
, is completely gone [edit: unfortunately, I can't reproduce this now, however hard I try; the lack of specialization seems to persist] with the https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 fix. Also, the fix does not introduce any performance regressions in the two benchmarks being tested (as measured with the SPECIALIZE pragma workarounds in place).
Given that 9.5.20220608 does not have better performance than 9.2.3 with the two workaround SPECIALIZE pragmas, it seems the workaround is completely effective and the only effect of the 7847 fix is making the workaround spurious (TODO: remove the workaround once we switch to GHC >= 9.6 and till then run our benchmarks periodically with HEAD to ensure we don't miss any SPECIALIZE pragmas that need adding as the library changes; [edit: and actually add a few dozen pragmas right now that it's clear GHC 9.2 won't get the fix]).
There are minor issues visible in the data above and both are filed together as https://gitlab.haskell.org/ghc/ghc/-/issues/21715.
BENCH1 compiled with HEAD (with the 7847 fix, or with workaround, it's all the same) allocates 18 times more than 9.2.3, copies 35 times more, but has only 5% more maximum residency and uses 9% more total memory. It runs 54% slower. (GHC 9.0.2 is close to 9.2.3, but takes somewhat less total memory still, as seen in older comments.)
In BENCH2, 9.2.3 produces a binary that is is 65% slower (MUT time) than from HEAD but, based on older data above, GHC 9.0.2 produces a binary that is 18% faster than HEAD.
BENCH1 compiled with HEAD (with the 7847 fix, or with workaround, it's all the same) allocates 18 times more than 9.2.3, copies 35 times more, but has only 5% more maximum residency and uses 9% more total memory. It runs 54% slower. (GHC 9.0.2 is close to 9.2.3, but takes somewhat less total memory still, as seen in older comments.)
In BENCH2, 9.2.3 produces a binary that is is 65% slower (MUT time) than from HEAD but, based on older data above, GHC 9.0.2 produces a binary that is 18% faster than HEAD.
These don't sound like "minor issues"! Are the precisely the issues you raise in GHC ticket 21715? There it says "54% slower whereas here you say "18 times slower". That's a big difference. I'm clearly confused
These don't sound like "minor issues"!
Compared to 18 times slowdown fixed by 7847, 54% and 18% slowdown is minor. Also, I reported it here (e.g., at https://github.com/Mikolaj/horde-ad/issues/14#issuecomment-1069391932) three months ago, so that's old news.
Are the precisely the issues you raise in GHC ticket 21715?
Yes, precisely.
There it says "54% slower whereas here you say "18 times slower". That's a big difference. I'm clearly confused
I'm saying "allocates 18 times more", unless I'm missing something that I wrote.
I've repeated the logging of GHC compilation output from https://github.com/Mikolaj/horde-ad/issues/14#issuecomment-1074191438. The difference is I used branch ghc-report-specialize
with the testMnist2
SPECIALIZE pragma commented out and I was using HEAD with 7847.
The very problem spotted there seems to be fixed (and ghc-report-specialize
is rather more likely to exhibit the problem than extra-pragmas
used in the original test (and not compiling with HEAD)). In particular, I'm getting, from the first invocation with -ddump-simpl
:
-- RHS size: {terms: 435, types: 555, coercions: 50, joins: 10/27}
testMnist2
:: forall r.
(Ord r, Floating r, IsScalar r) =>
Int -> Int -> [MnistData r] -> Domain r -> r
[GblId,
Arity=3,
Str=<MP(A,1CS(C1(L)),A,A,A,A,A,A)><LP(LP(LP(LCS(C1(L)),LCS(C1(L)),LCS(C1(L)),LCS(L),A,A,LCS(L)),LCS(C1(L)),LCS(L),A),A,LCS(L),A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A)><SP(SP(L,SL),SP(SL),L,L,L)>,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=NEVER}]
testMnist2
= \ (@r_a4Ub)
($dOrd_a4Uc :: Ord r_a4Ub)
($dFloating_a4Ud :: Floating r_a4Ub)
($d(%,,,,%)_a4Ue :: IsScalar r_a4Ub) ->
case ghc-prim:GHC.Types.eq_sel
that doesn't seem to unbox dictionaries (unless I'm looking at a wrong thing) and from second invocation, the one with -ddump-rule-firings
:
Rule fired:
SPEC/BenchMnistTools testMnist2 @Double (BenchMnistTools)
so the specialization seems to work fine.
When the pragma is enabled (and so performance is fine), I'm getting
Rule fired: SPEC testMnist2 (HordeAd.Tool.MnistFcnnScalar)
instead of the SPEC/BenchMnistTools
rule firing and I get additionally something inlined from other modules and something to do with w/w
-- RHS size: {terms: 17, types: 66, coercions: 4, joins: 0/0}
HordeAd.Tool.MnistFcnnScalar.testMnist2_paramsL :: DomainL Double
[GblId,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 45 0}]
HordeAd.Tool.MnistFcnnScalar.testMnist2_paramsL
...
HordeAd.Tool.MnistFcnnScalar.testMnist2_paramsV :: DomainV Double
[GblId,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 45 0}]
HordeAd.Tool.MnistFcnnScalar.testMnist2_paramsV
...
-- RHS size: {terms: 353, types: 365, coercions: 40, joins: 11/13}
HordeAd.Tool.MnistFcnnScalar.$w$stestMnist2 [InlPrag=[2]]
:: Int
-> Int
-> [MnistData Double]
-> Domain Double
-> ghc-prim:GHC.Prim.Double#
[GblId[StrictWorker([~, ~, !, ~])],
Arity=4,
Str=<L><L><SL><L>,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True, Guidance=NEVER}]
HordeAd.Tool.MnistFcnnScalar.$w$stestMnist2
...
-- RHS size: {terms: 13, types: 8, coercions: 0, joins: 0/0}
HordeAd.Tool.MnistFcnnScalar.testMnist2_$stestMnist2 [InlPrag=[2]]
:: Int -> Int -> [MnistData Double] -> Domain Double -> Double
[GblId,
Arity=4,
Str=<L><L><SL><L>,
Cpr=1,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=4,unsat_ok=True,boring_ok=False)
Tmpl= \ (widthHidden_sak0 [Occ=Once1] :: Int)
(widthHidden2_sak1 [Occ=Once1] :: Int)
(inputs_sak2 [Occ=Once1] :: [MnistData Double])
(params_sak3 [Occ=Once1] :: Domain Double) ->
case HordeAd.Tool.MnistFcnnScalar.$w$stestMnist2
widthHidden_sak0 widthHidden2_sak1 inputs_sak2 params_sak3
of ww_sanU [Occ=Once1]
{ __DEFAULT ->
ghc-prim:GHC.Types.D# ww_sanU
}}]
HordeAd.Tool.MnistFcnnScalar.testMnist2_$stestMnist2
...
"SPEC testMnist2"
forall ($d(%,,,,%)7_a4Yk :: IsScalar Double)
($dFloating_a4Yj :: Floating Double)
($dOrd_a4Yi :: Ord Double).
testMnist2 @Double $dOrd_a4Yi $dFloating_a4Yj $d(%,,,,%)7_a4Yk
= HordeAd.Tool.MnistFcnnScalar.testMnist2_$stestMnist2
Diffs of the whole logs are unreadable due to too much randomness.
After some more investigation, it seems !7847 really fixes the specialization, after all. See https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847#note_436933.
The corollary is that the workarounds we use are probably sufficient and so we can work with older GHCs by adding workarounds and verifying with HEAD (comparing performance with and without !7847) that the workarounds suffice.
After #11 is implemented, the outrageous BENCH1 performance problems without SPECIALIZE pragmas should vanish and I will try INLINEABLE pragmas as an extra or alternative workaround to try to make benchmarking reliable even for such fragile code as BENCH1, to make sure the #11 optimization results in a real performance improvement and not a measurement error due to GHC quirks.
Another corollary is that other recorded performance oddities (here and in #20), especially those with BENCH2, are most probably real and not caused/obscured by missed specialization, so they can be investigated separately (though perhaps using !7847 to save on workaround pragmas and/or verify none are missing). It is possible some of them are caused by the worker/wrapper clash with -fspecialise-aggressively
, but it's possible that -fspecialise-aggressively
is generally untested and has other problems as well. But I'd hate to move to annotating every function with INLINABLE
(which doesn't suffice in my experience) and/or a collection of SPECIALIZE
for several types and/or, as some do, INLINE
. IMHO the compiler should be doing that if the developer is fine suffering larger binary sizes and longer compilation times without -O0
.
[Edit: I've swapped the order of test result snippets and thaks to that noticed that INLINABLE
gives the same results as SPECIALIZE
]
I've just performed an experiment suggested by @simonpj: I've added
{-# INLINABLE sumTrainableInputs #-}
{-# INLINABLE sumConstantData #-}
to src/HordeAd/Tool/MnistFcnnScalar.hs
and benchmarked various setups with branch ghc-report-specialize
.
In short, in BENCH1 this pair of pragmas is exactly exchangeable with the SPECIALIZE testMnist2
pragma and having them both doesn't improve the performance further. Surprisingly, it also significantly affects BENCH2, which is not fragile (it doesn't do useless work, hoping GHC spots it and optimizes away and not affected by SPECIALIZE
). Detailed results follow, where ghc-9.5.20220606 is HEAD and ghc-9.5.20220608 is HEAD with !7847 added,
sad baseline: SPECIALIZE testMnist2
pragma is excluded (an no extra pragmas added)
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
208,919,050,976 bytes allocated in the heap
10,939,920 bytes copied during GC
75,684,616 bytes maximum residency (17 sample(s))
2,016,504 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 49924 colls, 0 par 1.087s 1.094s 0.0000s 0.0012s
Gen 1 17 colls, 0 par 0.043s 0.043s 0.0025s 0.0044s
INIT time 0.000s ( 0.000s elapsed)
MUT time 20.689s ( 20.684s elapsed)
GC time 1.130s ( 1.137s elapsed)
EXIT time 0.000s ( 0.009s elapsed)
Total time 21.819s ( 21.830s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 10,098,225,306 bytes per MUT second
Productivity 94.8% of total user, 94.7% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
207,965,511,848 bytes allocated in the heap
10,905,056 bytes copied during GC
75,652,384 bytes maximum residency (12 sample(s))
2,025,880 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 49678 colls, 0 par 0.897s 0.906s 0.0000s 0.0011s
Gen 1 12 colls, 0 par 0.026s 0.026s 0.0022s 0.0055s
INIT time 0.000s ( 0.000s elapsed)
MUT time 16.884s ( 16.874s elapsed)
GC time 0.924s ( 0.932s elapsed)
EXIT time 0.000s ( 0.004s elapsed)
Total time 17.809s ( 17.810s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 12,317,082,244 bytes per MUT second
Productivity 94.8% of total user, 94.7% of total elapsed
happy baseline: SPECIALIZE testMnist2
pragma is included
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
39,778,850,976 bytes allocated in the heap
1,055,320 bytes copied during GC
75,622,104 bytes maximum residency (11 sample(s))
2,011,136 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9431 colls, 0 par 0.240s 0.239s 0.0000s 0.0014s
Gen 1 11 colls, 0 par 0.021s 0.021s 0.0019s 0.0045s
INIT time 0.000s ( 0.000s elapsed)
MUT time 11.961s ( 11.963s elapsed)
GC time 0.260s ( 0.260s elapsed)
EXIT time 0.000s ( 0.007s elapsed)
Total time 12.221s ( 12.230s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,325,722,800 bytes per MUT second
Productivity 97.9% of total user, 97.8% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
39,685,511,848 bytes allocated in the heap
1,051,768 bytes copied during GC
75,622,088 bytes maximum residency (11 sample(s))
2,010,320 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9409 colls, 0 par 0.221s 0.221s 0.0000s 0.0017s
Gen 1 11 colls, 0 par 0.023s 0.023s 0.0021s 0.0058s
INIT time 0.000s ( 0.000s elapsed)
MUT time 11.918s ( 11.918s elapsed)
GC time 0.244s ( 0.244s elapsed)
EXIT time 0.000s ( 0.008s elapsed)
Total time 12.163s ( 12.170s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,329,799,683 bytes per MUT second
Productivity 98.0% of total user, 97.9% of total elapsed
SPECIALIZE out, INLINABLE in; the performance is probably the same as with SPECIALIZE, the differences being noise
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
39,778,850,976 bytes allocated in the heap
1,055,328 bytes copied during GC
75,622,104 bytes maximum residency (11 sample(s))
2,011,136 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9431 colls, 0 par 0.236s 0.236s 0.0000s 0.0011s
Gen 1 11 colls, 0 par 0.022s 0.023s 0.0021s 0.0061s
INIT time 0.000s ( 0.000s elapsed)
MUT time 11.960s ( 11.963s elapsed)
GC time 0.258s ( 0.258s elapsed)
EXIT time 0.000s ( 0.009s elapsed)
Total time 12.218s ( 12.230s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,326,022,439 bytes per MUT second
Productivity 97.9% of total user, 97.8% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 10 -m prefix "2-hidden-layer MNIST nn with samples: 500/test 500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/test 500|150 s469160 v0 m0=469160
39,685,511,848 bytes allocated in the heap
1,051,768 bytes copied during GC
75,622,088 bytes maximum residency (11 sample(s))
2,010,320 bytes maximum slop
157 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9409 colls, 0 par 0.223s 0.222s 0.0000s 0.0013s
Gen 1 11 colls, 0 par 0.021s 0.022s 0.0020s 0.0047s
INIT time 0.000s ( 0.000s elapsed)
MUT time 11.933s ( 11.935s elapsed)
GC time 0.244s ( 0.244s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 12.177s ( 12.180s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,325,817,377 bytes per MUT second
Productivity 98.0% of total user, 98.0% of total elapsed
baseline (SPECIALIZE testMnist2
pragma is irrelevant for this benchmark)
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
34,265,500,784 bytes allocated in the heap
346,392 bytes copied during GC
120,458,736 bytes maximum residency (190 sample(s))
2,865,072 bytes maximum slop
276 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6595 colls, 0 par 14.347s 14.361s 0.0022s 0.0068s
Gen 1 190 colls, 0 par 3.138s 3.140s 0.0165s 0.0307s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.678s ( 8.661s elapsed)
GC time 17.485s ( 17.501s elapsed)
EXIT time 0.000s ( 0.008s elapsed)
Total time 26.163s ( 26.170s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,948,643,953 bytes per MUT second
Productivity 33.2% of total user, 33.1% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
34,036,245,656 bytes allocated in the heap
354,312 bytes copied during GC
123,780,712 bytes maximum residency (185 sample(s))
2,890,416 bytes maximum slop
269 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6601 colls, 0 par 13.943s 13.957s 0.0021s 0.0060s
Gen 1 185 colls, 0 par 3.091s 3.093s 0.0167s 0.0304s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.285s ( 8.268s elapsed)
GC time 17.034s ( 17.050s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 25.319s ( 25.320s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,108,024,872 bytes per MUT second
Productivity 32.7% of total user, 32.7% of total elapsed
INLINABLE pragmas in
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
30,305,288,784 bytes allocated in the heap
296,776 bytes copied during GC
119,938,344 bytes maximum residency (176 sample(s))
2,859,712 bytes maximum slop
267 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5609 colls, 0 par 13.246s 13.259s 0.0024s 0.0064s
Gen 1 176 colls, 0 par 3.305s 3.307s 0.0188s 0.0310s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.121s ( 8.109s elapsed)
GC time 16.551s ( 16.565s elapsed)
EXIT time 0.000s ( 0.006s elapsed)
Total time 24.672s ( 24.680s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,731,873,708 bytes per MUT second
Productivity 32.9% of total user, 32.9% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
30,288,245,656 bytes allocated in the heap
296,800 bytes copied during GC
119,971,584 bytes maximum residency (182 sample(s))
2,897,656 bytes maximum slop
271 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5604 colls, 0 par 13.456s 13.469s 0.0024s 0.0071s
Gen 1 182 colls, 0 par 3.159s 3.162s 0.0174s 0.0307s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.168s ( 8.155s elapsed)
GC time 16.616s ( 16.631s elapsed)
EXIT time 0.000s ( 0.005s elapsed)
Total time 24.784s ( 24.790s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,708,312,904 bytes per MUT second
Productivity 33.0% of total user, 32.9% of total elapsed
To help with https://gitlab.haskell.org/ghc/ghc/-/issues/21715, where GHC 9.0.2 is reported to be 18% faster than HEAD in BENCH2, I tested the INLINABLE
pragmas with GHC 9.0.2.
baseline (SPECIALIZE testMnist2 pragma is irrelevant for this benchmark); the speedup and lower allocation over HEAD is reproduced:
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
29,993,632,336 bytes allocated in the heap
36,535,814,368 bytes copied during GC
119,503,384 bytes maximum residency (253 sample(s))
2,919,120 bytes maximum slop
268 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 22918 colls, 0 par 11.925s 11.938s 0.0005s 0.0016s
Gen 1 253 colls, 0 par 1.955s 1.956s 0.0077s 0.0318s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.028s ( 8.026s elapsed)
GC time 13.880s ( 13.895s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 21.908s ( 21.921s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,736,110,581 bytes per MUT second
Productivity 36.6% of total user, 36.6% of total elapsed
INLINABLE pragmas in, leading to a disaster
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
59,994,964,336 bytes allocated in the heap
36,165,580,184 bytes copied during GC
87,233,960 bytes maximum residency (260 sample(s))
2,557,304 bytes maximum slop
186 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 51411 colls, 0 par 12.601s 12.623s 0.0002s 0.0021s
Gen 1 260 colls, 0 par 1.596s 1.598s 0.0061s 0.0086s
INIT time 0.000s ( 0.000s elapsed)
MUT time 21.195s ( 21.193s elapsed)
GC time 14.198s ( 14.220s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 35.392s ( 35.413s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,830,684,351 bytes per MUT second
Productivity 59.9% of total user, 59.8% of total elapsed
Edit: For completeness, just as with other GHC versions, in the presence of SPECIALIZE
, the INLINABLE
pragmas don't affect GHC 9.0.2 in the BENCH1 test. However, unlike with HEAD, INLINABLE
doesn't repair performance in the absence of SPECIALIZE
and has a minimal impact in that case.
Since GHC 9.0.2 reacts badly to INLINABLE
(see the previous comment), there goes my idea to supplement -fexpose-all-unfoldings -fspecialise-aggressively
with INLINABLE
for each function in the codebase and never look back (and remove all SPECIALIZE
and proxies needed by them and manual inlines needed by them, if any). I think I will stick with -fexpose-all-unfoldings -fspecialise-aggressively
(removing it breaks performance) and with a few SPECIALIZE
and test with !7847 whether we have enough of the latter and do A/B tests about whether any extra INLINABLE
are needed and also compare results between GHC HEAD, 9.0.2 and GHC 9.2.3, with and without various sets of INLINABLE
, and eventually benchmark using the GHC version that produces the fastest code for a test in hand.
This is no longer a blocker, because we understand what's going and we've removed the code that was particularly susceptible to the performance swings.
What's going on is
We've removed the code that was particularly susceptible to the performance swings in #11. The code was recursively accumulating delta expressions in the second component of dual numbers and then ignoring them, because it was only computing the value of the objective function, not a gradient. The new code consistently and repeatedly puts ()
into the second component of dual numbers. This is still wasteful and GHC apparently sometimes optimizes it away and sometimes refuses to, but that's a 20% runtime swing (with HEAD+7847) and no allocation difference, as opposed to many-fold differences.
However, note that this code is no longer naive and exotic, so it's likely a similar potential 20% speedup comes up in the wild often enough and, regardless of speed gains, performance consistency is very valuable, because 20% may easily mask a good algorithmic decision elsewhere and lead to performance-harming decisions due to the noise in benchmarks. BTW, 9.2.3 is a further 20% faster in exactly the same benchmark, see also 21715, which reports similar results with the fragile version of the code.
Anyway, let's keep this ticket open to remember to extend the workarounds to other benchmarks and maintain them as the code changes and avoid performance-wise naive code in order to prevent benchmarking noise.
I can no longer reproduce these problems with GHC 9.6.1, though 9.4.4 is still heavily affected (most recently, these SPECIALIZE pragmas were needed https://github.com/Mikolaj/horde-ad/blob/a361d2694df4f181ee6256ac8d0748e66d5896d4/simplified/HordeAd/Core/AstInterpret.hs#L321-L426 to avoid 50% slowdown vs 9.6.1, but some must have been missed, because tests are still 20% slower on 9.4.4). The only remaining related performance discrepancies (reported as GHC tickets) are all around 10% runtime overhead, so surely not stemming from lack of specialization of any function on the hot path (since it tends to automatically disable specialization of all other functions that depend on the unspecialized code).
Success. Closing.
In short: despite
-fexpose-all-unfoldings -fspecialise-aggressively
, twoSPECIALIZE
pragmas are required to significantly speed up and lower allocation for the code. Confirmed with GHC 8.10.7, 9.0.2 and 9.2.1.That would imply that there is a specialization that is not performed by GHC without the pragmas. However, despite
-Wall-missed-specialisations
, no missed specialization is reported in the code withoutSPECIALIZE
(with 8.10.7 and 9.0.2; ~OTOH 9.2.1 spams so much that it's hard to tell; we should try 9.2.2 once it's out~ [edit: 9.2.2 repro done, see below; in short: less spam, but no relevant warning]).It would be great to understand if these are GHC bugs that needs to be reported and whether we may have other not applied specilizations that
-Wall-missed-specialisations
is not telling us about. Is there a cheap way to find them? Can we fix them or rule them out somehow even without finding them individually? Any there any ideas emerging from this case about how to speed up this particular part of code (which is computing the value of the objective function, ignoring derivatives; #11)?Reproducing BENCH1 with ghc-9.0.2:
This provides the baseline, in which things are apparently not specialized. Note that there is no warning about missing specialization. For me, the results are
The following only adds a commit that introduces the two SPECIALIZE pragmas. [Edit: and then two others that I added only to eliminate thunks that would obscure what's going on, No significant change of results due to their elimination.]
My results are
The real story, the one recorded on master branch, is that I originally applied a workaround that apparently specialized some things without needing SPECIALIZE. It sits halfway between the two extremes above performance-wise. Probably not important, but it may indicate where the lack of specialization is located. Commit https://github.com/Mikolaj/horde-ad/commit/229a9b834c80a1a98ece769ee49a634951f25405 from master branch, while adding the two SPECIALIZE, removes the workaround (and the commit message fails to explain what's going on).