Closed Mikolaj closed 1 year ago
@simonpj and @goldfirere this may need your input?
According to what we planned during our meeting today, I've started a two-fold investigation of the performance drop:
-fspecialise-aggressively
] for all exported functions along the code path of test BENCH2, on branch extra-pragmas-BENCH2
, which only partially helped with GHC 8.10.7 and 9.0.2 and made no difference with GHC 9.2.2. This indicates the slowdown in at most partially from the SPECIALIZE bug reproducible with BENCH1 that we are about to fix (https://github.com/Mikolaj/horde-ad/issues/14#issuecomment-1076443693). So we should not wait for the fix and investigate independently.ghc-report-specialize
and after branch is extra-pragmas-BENCH2
. A reminder: after is twice slower and allocates three times more than before. I attach the two profiles: mnist.9.0.2.before.BENCH2.prof.txt mnist.9.0.2.after.BENCH2.prof.txtSome observations: the allocation is unchanged except for four functions that allocate all the extra heap space, the biggest three of which are easy to pinpoint, because they are the right hand sides of pattern matching in the case expression in the three last lines of this snippet:
Regarding the slowdown, it seems to just follow the blowup, because the same functions seem to be called the same number of times, just taking proportionally more time if they are entangled in the spurious allocations. E.g., here's the function corresponding to pattern Add0
in profiles before and after:
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc (inherited)
buildVectors.eval0.\ HordeAd.Core.Delta src/HordeAd/Core/Delta.hs:174:26-47 1751 272510500 3.8 0.0 3.8 0.0
buildFinMaps.eval0.\ HordeAd.Internal.Delta src/HordeAd/Internal/Delta.hs:385:21-42 1837 272510500 12.3 22.6 27.7 38.1
Profiling remarks from 2 weeks ago in the middle section of https://github.com/Mikolaj/horde-ad/issues/14#issuecomment-1062250358 show the same pattern, so this is not caused by the recent overhaul of Dual class typing, but has been introduced at some point between four and two weeks ago. If that's closer to four weeks ago, it may be possible to bisect and pinpoint the offending commit (I think I only checked performance with 9.2.2 at that time, so I missed it), if closer to two weeks ago, bisecting wouldn't be possible, because we couldn't use 8.10.7 and 9.0.2 during this period and 9.2.2 was slow both before and after.
I've looked at Core code obtained via {-# OPTIONS_GHC -ddump-stg #-}
from file Delta.hs. Perhaps it's not the right thing to look at, because it's not yet specialized. Given that the core for the Add0
case looks the same, the slowdown probably comes from specialization or simplification when this function is called elsewhere and so there's no important difference in the core below. We'd need to obtain core code from where the function is used. Anyway, here's the first step.
After the code overhaul (when it became slow) it's (note ww85_s2eHQ)
...
case ww6_s2eGr of {
Internal.Numeric.C:Product ww75_s2eHG [Occ=Once1!]
_ [Occ=Dead]
_ [Occ=Dead]
_ [Occ=Dead]
_ [Occ=Dead]
_ [Occ=Dead]
_ [Occ=Dead] ->
case ww75_s2eHG of {
GHC.Num.C:Num ww83_s2eHO
_ [Occ=Dead]
ww85_s2eHQ
_ [Occ=Dead]
_ [Occ=Dead]
_ [Occ=Dead]
ww89_s2eHU ->
...
eval0_s2eIa [Occ=LoopBreaker]
:: r_aWpP
-> HordeAd.Internal.Delta.Delta0 r_aWpP
-> GHC.ST.STRep GHC.Prim.RealWorld ()
[LclId, Arity=3, Str=<S,1*U><S,1*U><L,U>, Unf=OtherCon []] =
\r [r1_s2gk5 eta6_s2gk6 eta7_s2gk7]
case r1_s2gk5 of r2_s2gk8 {
__DEFAULT ->
case eta6_s2gk6 of {
HordeAd.Internal.Delta.Zero0 -> (#,#) [eta7_s2gk7 GHC.Tuple.()];
HordeAd.Internal.Delta.Scale0 k_s2gka [Occ=Once1]
d_s2gkb [Occ=Once1] ->
case ww85_s2eHQ k_s2gka r2_s2gk8 of sat_s2gkc [Occ=Once1] {
__DEFAULT -> eval0_s2eIa sat_s2gkc d_s2gkb eta7_s2gk7;
};
HordeAd.Internal.Delta.Add0 d_s2gkd [Occ=Once1]
e_s2gke [Occ=Once1] ->
case eval0_s2eIa r2_s2gk8 d_s2gkd eta7_s2gk7 of {
(#,#) ipv2_s2gkg [Occ=Once1] _ [Occ=Dead] ->
eval0_s2eIa r2_s2gk8 e_s2gke ipv2_s2gkg;
};
while before the code overhaul (when it was still fast) it's (note GHC.Num.* instead of ww85_s2eHQ; other than that it seems the same)
eval0_spgW [Occ=LoopBreaker]
:: r_agec
-> HordeAd.Core.Delta.DeltaScalar
r_agec
-> GHC.ST.STRep
GHC.Prim.RealWorld
()
[LclId,
Arity=3,
Str=<S,1*U><S,1*U><L,U>,
Unf=OtherCon []] =
\r [r1_sptN
eta5_sptO
eta6_sptP]
case
r1_sptN
of
r2_sptQ
{
__DEFAULT ->
case
eta5_sptO
of
{ HordeAd.Core.Delta.ZeroScalar ->
(#,#) [eta6_sptP
GHC.Tuple.()];
HordeAd.Core.Delta.ScaleScalar k_sptS [Occ=Once1]
d_sptT [Occ=Once1] ->
case
GHC.Num.*
$dNum1_spgf
k_sptS
r2_sptQ
of
sat_sptU [Occ=Once1]
{
__DEFAULT ->
eval0_spgW
sat_sptU
d_sptT
eta6_sptP;
};
HordeAd.Core.Delta.AddScalar d_sptV [Occ=Once1]
e_sptW [Occ=Once1] ->
case
eval0_spgW
r2_sptQ
d_sptV
eta6_sptP
of
{
(#,#) ipv8_sptY [Occ=Once1]
_ [Occ=Dead] ->
eval0_spgW
r2_sptQ
e_sptW
ipv8_sptY;
};
For completeness, below are all occurences of eval0
.
Before (fast) it's
while after (slow, with lots of SPECIALIZE to Double
in order to eliminate the possibility that the BENCH1 bug is the culprit; perhaps we can ignore the last two of four eval0
versions that come from that) it's
I'm now pausing the investigation until I get any feedback or the fix to #14 is available for testing. I'm sooo perplexed. The code that specializes is slow, the one that (erroneously) doesn't, is fast. Perhaps when #14 is fixed, both will be specialized (without much changed performance) and so it won't look so weird.
I've looked at Core code output of {-# OPTIONS_GHC -ddump-stg #-}
in HordeAd.Core.Optimizer
where eval0
is used and the code is specialized to Double
in the after (slow) variant (probably because I've added the SPECIALIZE pragmas). It looks legit, with Double
unboxing, etc. But it's slow.
$weval0_sLUO [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
:: GHC.Prim.Double#
-> HordeAd.Internal.Delta.Delta0
(HordeAd.Core.DualClass.Primal
(HordeAd.Internal.Delta.Delta0 GHC.Types.Double))
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
[LclId, Arity=3, Str=<L,U><S,1*U><L,U>, Unf=OtherCon []] =
\r [ww5_sLUU w5_sLUV w6_sLUW]
case w5_sLUV of {
HordeAd.Internal.Delta.Zero0 -> (#,#) [w6_sLUW GHC.Tuple.()];
HordeAd.Internal.Delta.Scale0 k_sLUY [Occ=Once1!]
d_sLUZ [Occ=Once1] ->
case k_sLUY of {
GHC.Types.D# x_sLV1 [Occ=Once1] ->
case *## [x_sLV1 ww5_sLUU] of sat_sLV2 [Occ=Once1] {
__DEFAULT -> $weval0_sLUO sat_sLV2 d_sLUZ w6_sLUW;
};
};
HordeAd.Internal.Delta.Add0 d_sLV3 [Occ=Once1]
e_sLV4 [Occ=Once1] ->
case $weval0_sLUO ww5_sLUU d_sLV3 w6_sLUW of {
(#,#) ipv2_sLV6 [Occ=Once1] _ [Occ=Dead] ->
$weval0_sLUO ww5_sLUU e_sLV4 ipv2_sLV6;
};
The Core code output in the before (fast) version has no specialization (which is a bug), even though the required unfolding seems present in the .hi file. But it's fast.
case
HordeAd.Core.Delta.evalBindings
$dEq_s1n89
$dNumeric_s1n8k
$dNum_s1n8i
sat_s1n9b
sat_s1n9g
sat_s1n9l
st_s1n93
d_s1n96
of
Given that, the actual core used for the before (fast) variant is probably the same as in the comment above, which seems very similar, except for the Double
specialization and so unboxing. Two things stand out, but I have no idea if they matter: in the code from previous comment, there is Str=<S,1*U><S,1*U><L,U>
, while in the specialized code we have Str=<L,U><S,1*U><L,U>
and there is a silly type in the specialized code (HordeAd.Core.DualClass.Primal (HordeAd.Internal.Delta.Delta0 GHC.Types.Double))
, which should just be Double
(but compiler knows that and unboxes as expected).
Edit: but why is the Double
boxed inside the Scale0
constructor? It should be unpacked! Isn't the datatype specialized to Double
? Is parameterized datatype not a zero-cost abstraction? Would UNPACK
pragma help? [edit: I now remember this is an old and known problem and, e.g., https://hackage.haskell.org/package/unpacked-containers works around it; I've opened #23 for that.]
Edit2: adding Strict
to Delta.hs didn't change anything, so it's not strictness (StrictData
is from day 1 in the .cabal file).
I'm desperate, because we are over 2 months without reliable benchmarks and again today I refrained from a refactoring, because I have no way of reliably measuring if it's acceptable performance-wise (it introduces some GADTs). Could somebody please have a look at what I've found so far and offer any feedback? #14 is still not fixed, but I can't wait any longer to see if #14 is the culprit or obscures the investigation. I'm going to spend some more time on this ticket, but all this time will be wasted if I'm missing something obvious. Please help.
@simonpj and @goldfirere this may need your input?
I agree with @awf here. I'm afraid my GHC knowledge is insufficient to help with this.
This kind of thing is definitely more in @simonpj's wheelhouse than mine. But I'm happy to discuss on a call including Simon -- I can learn new things. Is that the best way forward?
Results from a new series of tests. Some of these have plugins ripped off and so can include results with HEAD and with https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 applied to HEAD --- ghc-9.5.20220606
is HEAD from 20220608 and ghc-9.5.20220608
additionally has https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 applied.
Branch ghc-report-specialize
, the baseline. This also exactly reproduces the measurements recorded in #14 (but note that ghc-9.3.20220316 and ghc-9.5.20220606 differ a lot):
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-8.10.7/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
29,969,255,120 bytes allocated in the heap
38,317,259,072 bytes copied during GC
124,406,080 bytes maximum residency (227 sample(s))
2,922,560 bytes maximum slop
274 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 22944 colls, 0 par 11.853s 11.909s 0.0005s 0.0016s
Gen 1 227 colls, 0 par 3.197s 3.210s 0.0141s 0.0296s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.819s ( 7.844s elapsed)
GC time 15.050s ( 15.119s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 22.869s ( 22.963s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,832,835,823 bytes per MUT second
Productivity 34.2% of total user, 34.2% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
29,993,632,336 bytes allocated in the heap
36,535,814,368 bytes copied during GC
119,503,384 bytes maximum residency (253 sample(s))
2,919,120 bytes maximum slop
268 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 22918 colls, 0 par 11.871s 11.882s 0.0005s 0.0020s
Gen 1 253 colls, 0 par 1.945s 1.946s 0.0077s 0.0311s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.887s ( 7.885s elapsed)
GC time 13.816s ( 13.828s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 21.703s ( 21.714s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,802,719,039 bytes per MUT second
Productivity 36.3% of total user, 36.3% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
73,220,358,784 bytes allocated in the heap
1,808,744 bytes copied during GC
119,593,208 bytes maximum residency (201 sample(s))
2,928,528 bytes maximum slop
262 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 16080 colls, 0 par 13.622s 13.636s 0.0008s 0.0027s
Gen 1 201 colls, 0 par 2.961s 2.962s 0.0147s 0.0284s
INIT time 0.000s ( 0.000s elapsed)
MUT time 26.252s ( 26.256s elapsed)
GC time 16.583s ( 16.598s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 42.835s ( 42.854s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,789,111,099 bytes per MUT second
Productivity 61.3% of total user, 61.3% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
34,265,500,784 bytes allocated in the heap
346,392 bytes copied during GC
120,458,736 bytes maximum residency (190 sample(s))
2,865,072 bytes maximum slop
276 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6595 colls, 0 par 14.210s 14.228s 0.0022s 0.0065s
Gen 1 190 colls, 0 par 3.122s 3.125s 0.0164s 0.0308s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.590s ( 8.585s elapsed)
GC time 17.332s ( 17.354s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 25.922s ( 25.940s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,988,804,129 bytes per MUT second
Productivity 33.1% of total user, 33.1% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
34,036,245,656 bytes allocated in the heap
354,304 bytes copied during GC
123,780,712 bytes maximum residency (185 sample(s))
2,890,416 bytes maximum slop
269 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6601 colls, 0 par 13.873s 13.889s 0.0021s 0.0058s
Gen 1 185 colls, 0 par 3.077s 3.080s 0.0166s 0.0308s
INIT time 0.000s ( 0.000s elapsed)
MUT time 8.171s ( 8.164s elapsed)
GC time 16.950s ( 16.969s elapsed)
EXIT time 0.000s ( 0.007s elapsed)
Total time 25.121s ( 25.140s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,165,636,535 bytes per MUT second
Productivity 32.5% of total user, 32.5% of total elapsed
The original branch issue-20
at which the problem has been spotted:
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-8.10.7/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
92,986,374,864 bytes allocated in the heap
40,901,420,272 bytes copied during GC
123,422,176 bytes maximum residency (225 sample(s))
2,875,704 bytes maximum slop
272 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 82454 colls, 0 par 12.759s 12.787s 0.0002s 0.0015s
Gen 1 225 colls, 0 par 4.744s 4.749s 0.0211s 0.0323s
INIT time 0.000s ( 0.000s elapsed)
MUT time 28.328s ( 28.341s elapsed)
GC time 17.503s ( 17.536s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 45.830s ( 45.877s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,282,546,926 bytes per MUT second
Productivity 61.8% of total user, 61.8% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
93,000,436,368 bytes allocated in the heap
40,616,029,992 bytes copied during GC
123,421,896 bytes maximum residency (229 sample(s))
2,879,376 bytes maximum slop
272 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 82450 colls, 0 par 12.613s 12.648s 0.0002s 0.0014s
Gen 1 229 colls, 0 par 4.547s 4.555s 0.0199s 0.0332s
INIT time 0.000s ( 0.000s elapsed)
MUT time 30.486s ( 30.526s elapsed)
GC time 17.160s ( 17.203s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 47.645s ( 47.729s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,050,631,792 bytes per MUT second
Productivity 64.0% of total user, 64.0% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
106,033,830,776 bytes allocated in the heap
2,666,720 bytes copied during GC
123,990,232 bytes maximum residency (224 sample(s))
2,902,184 bytes maximum slop
285 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 23557 colls, 0 par 14.373s 14.389s 0.0006s 0.0026s
Gen 1 224 colls, 0 par 3.833s 3.836s 0.0171s 0.0315s
INIT time 0.000s ( 0.000s elapsed)
MUT time 35.800s ( 35.812s elapsed)
GC time 18.207s ( 18.225s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 54.007s ( 54.037s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,961,877,948 bytes per MUT second
Productivity 66.3% of total user, 66.3% of total elapsed
and branch issue-20-without-plugins
tweaked only just enough to run HEAD on it (!7847 leads to 20% slowdown here)
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
88,523,940,792 bytes allocated in the heap
808,552 bytes copied during GC
118,684,856 bytes maximum residency (213 sample(s))
2,864,776 bytes maximum slop
282 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 19574 colls, 0 par 14.921s 14.963s 0.0008s 0.0034s
Gen 1 213 colls, 0 par 3.645s 3.654s 0.0172s 0.0331s
INIT time 0.000s ( 0.000s elapsed)
MUT time 30.808s ( 30.826s elapsed)
GC time 18.566s ( 18.616s elapsed)
EXIT time 0.000s ( 0.008s elapsed)
Total time 49.374s ( 49.450s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,873,446,619 bytes per MUT second
Productivity 62.4% of total user, 62.3% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
88,571,121,880 bytes allocated in the heap
808,576 bytes copied during GC
123,240,224 bytes maximum residency (209 sample(s))
2,900,480 bytes maximum slop
288 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 19579 colls, 0 par 15.005s 15.028s 0.0008s 0.0043s
Gen 1 209 colls, 0 par 3.916s 3.919s 0.0188s 0.0334s
INIT time 0.000s ( 0.000s elapsed)
MUT time 39.076s ( 39.060s elapsed)
GC time 18.921s ( 18.947s elapsed)
EXIT time 0.000s ( 0.003s elapsed)
Total time 57.997s ( 58.010s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,266,648,087 bytes per MUT second
Productivity 67.4% of total user, 67.3% of total elapsed
The original branch extra-pragmas-BENCH2
mentioned early in his ticket that adds some likely and unlikely extra SPECIALIZE pragmas and proxies compared to branch issue-20
:
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-8.10.7/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
62,690,403,808 bytes allocated in the heap
39,888,985,560 bytes copied during GC
123,908,360 bytes maximum residency (233 sample(s))
2,916,184 bytes maximum slop
278 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 53938 colls, 0 par 11.944s 11.956s 0.0002s 0.0017s
Gen 1 233 colls, 0 par 4.230s 4.232s 0.0182s 0.0323s
INIT time 0.000s ( 0.000s elapsed)
MUT time 16.421s ( 16.418s elapsed)
GC time 16.174s ( 16.188s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 32.595s ( 32.607s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,817,733,452 bytes per MUT second
Productivity 50.4% of total user, 50.4% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
73,309,205,216 bytes allocated in the heap
40,590,756,360 bytes copied during GC
124,398,448 bytes maximum residency (230 sample(s))
2,881,904 bytes maximum slop
275 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 63941 colls, 0 par 12.315s 12.332s 0.0002s 0.0016s
Gen 1 230 colls, 0 par 4.667s 4.670s 0.0203s 0.0329s
INIT time 0.000s ( 0.000s elapsed)
MUT time 18.501s ( 18.499s elapsed)
GC time 16.982s ( 17.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 35.483s ( 35.500s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,962,535,916 bytes per MUT second
Productivity 52.1% of total user, 52.1% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
105,856,182,136 bytes allocated in the heap
2,633,728 bytes copied during GC
124,410,816 bytes maximum residency (220 sample(s))
2,920,704 bytes maximum slop
293 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 23559 colls, 0 par 14.657s 14.673s 0.0006s 0.0035s
Gen 1 220 colls, 0 par 4.069s 4.072s 0.0185s 0.0306s
INIT time 0.000s ( 0.000s elapsed)
MUT time 36.184s ( 36.197s elapsed)
GC time 18.727s ( 18.745s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 54.911s ( 54.942s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,925,461,743 bytes per MUT second
Productivity 65.9% of total user, 65.9% of total elapsed
Branch extra-pragmas-BENCH2-HEAD
that is similar to extra-pragmas-BENCH2
, but has even more random proxies backing up SPECIALIZE, just in case, and the plugins gutted out to be compilable with HEAD:
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-8.10.7/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
58,827,983,808 bytes allocated in the heap
38,493,956,792 bytes copied during GC
124,405,240 bytes maximum residency (227 sample(s))
2,894,304 bytes maximum slop
285 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 49944 colls, 0 par 12.086s 12.104s 0.0002s 0.0017s
Gen 1 227 colls, 0 par 3.613s 3.617s 0.0159s 0.0310s
INIT time 0.000s ( 0.000s elapsed)
MUT time 15.403s ( 15.411s elapsed)
GC time 15.700s ( 15.722s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 31.103s ( 31.132s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,819,164,921 bytes per MUT second
Productivity 49.5% of total user, 49.5% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.0.2/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
69,478,029,216 bytes allocated in the heap
40,785,590,224 bytes copied during GC
124,407,880 bytes maximum residency (228 sample(s))
2,881,736 bytes maximum slop
274 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 60443 colls, 0 par 12.557s 12.573s 0.0002s 0.0016s
Gen 1 228 colls, 0 par 4.952s 4.955s 0.0217s 0.0329s
INIT time 0.000s ( 0.000s elapsed)
MUT time 17.392s ( 17.393s elapsed)
GC time 17.510s ( 17.528s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 34.902s ( 34.921s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,994,815,350 bytes per MUT second
Productivity 49.8% of total user, 49.8% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.2.3/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
58,764,870,096 bytes allocated in the heap
574,432 bytes copied during GC
124,410,472 bytes maximum residency (176 sample(s))
2,879,632 bytes maximum slop
295 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 12104 colls, 0 par 14.253s 14.264s 0.0012s 0.0053s
Gen 1 176 colls, 0 par 4.956s 4.959s 0.0282s 0.0337s
INIT time 0.000s ( 0.000s elapsed)
MUT time 18.668s ( 18.674s elapsed)
GC time 19.209s ( 19.223s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 37.877s ( 37.897s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,147,871,898 bytes per MUT second
Productivity 49.3% of total user, 49.3% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220606/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
54,361,525,736 bytes allocated in the heap
516,920 bytes copied during GC
119,937,176 bytes maximum residency (176 sample(s))
2,799,864 bytes maximum slop
280 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 11110 colls, 0 par 13.856s 13.878s 0.0012s 0.0081s
Gen 1 176 colls, 0 par 4.657s 4.663s 0.0265s 0.0333s
INIT time 0.000s ( 0.000s elapsed)
MUT time 16.279s ( 16.278s elapsed)
GC time 18.513s ( 18.541s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 34.792s ( 34.820s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,339,439,449 bytes per MUT second
Productivity 46.8% of total user, 46.7% of total elapsed
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
54,344,406,608 bytes allocated in the heap
516,928 bytes copied during GC
119,945,280 bytes maximum residency (176 sample(s))
2,770,872 bytes maximum slop
280 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 11111 colls, 0 par 13.697s 13.721s 0.0012s 0.0055s
Gen 1 176 colls, 0 par 4.648s 4.655s 0.0264s 0.0332s
INIT time 0.000s ( 0.000s elapsed)
MUT time 19.119s ( 19.127s elapsed)
GC time 18.344s ( 18.375s elapsed)
EXIT time 0.000s ( 0.007s elapsed)
Total time 37.463s ( 37.510s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,842,471,674 bytes per MUT second
Productivity 51.0% of total user, 51.0% of total elapsed
The results are inconclusive. On one hand, it seem the workaround from ghc-report-specialize
was not enough for the newer code that has more complex types. The performance improved in extra-pragmas-BENCH2
thanks to adding all likely and unlikely SPECIALIZE pragmas and improved even more (especially in GHC 9.2.3) in extra-pragmas-BENCH2-HEAD
by adding some extra proxies that back the pragmas. But still the performance does not match between the code versions that essentially differ only in typing and also between GHC versions.
I conjecture #14 is responsible for all the performance regained via workarounds and possibly for all or part of the remaining performance gap. I'm just not diligent enough to track down all pragmas that are needed as a workaround, so I might have missed some. Also GHC accepts pragmas that are void, because the types given do not suffice to resolve all type variables (which is why adding proxies that carry the valuation of the remaining type variables improves performance). Also, proxies are not possible in some operations, e.g., Num
operations, so manual inlining and then specialization (and proxies) is needed and I did some of that, but patience failed me.
I think we should wait until #14 is resolved.
I've retested BENCH2 on branch extra-pragmas-BENCH2-HEAD with
{-# INLINABLE sumTrainableInputs #-}
{-# INLINABLE sumConstantData #-}
instead of SPECIALIZE
for the same functions. Memory behaviour is exactly the same, but runtime seems a bit faster.
~/r/horde-ad$ /home/mikolaj/r/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.5.20220608/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist -n 1 -m prefix "2-hidden-layer MNIST nn with samples: 500/500" +RTS -s
benchmarking 2-hidden-layer MNIST nn with samples: 500/500|150 s469160 v0 m0=469160
54,344,406,608 bytes allocated in the heap
516,936 bytes copied during GC
119,945,280 bytes maximum residency (176 sample(s))
2,770,872 bytes maximum slop
280 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 11111 colls, 0 par 13.856s 13.868s 0.0012s 0.0065s
Gen 1 176 colls, 0 par 4.657s 4.660s 0.0265s 0.0332s
INIT time 0.000s ( 0.000s elapsed)
MUT time 17.554s ( 17.536s elapsed)
GC time 18.513s ( 18.528s elapsed)
EXIT time 0.000s ( 0.005s elapsed)
Total time 36.067s ( 36.070s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,095,861,693 bytes per MUT second
Productivity 48.7% of total user, 48.6% of total elapsed
So this doesn't seem the magic bullet to align the performance of the two code snapshots (that probably differ only in typing), though I haven't tried all combinations. It may be worthwhile to at least try changing every SPECIALIZE
to INLINABLE
and test this straightforward variant, Additionally doing that for testMnist0
doesn't change anything, but there's a lot more SPECIALIZE
on that branch and they matter for performance. [Edit: testMnist0
is not called in this benchmark, but doing that for nnMnistLoss0
in a different branch doesn't seem to affect performance either.]
I've recorded three ticky profiles of the benchmark (BENCH2) with HEAD from 20220606 and https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 applied. They are with branches https://github.com/Mikolaj/horde-ad/tree/ghc-report-specialize, https://github.com/Mikolaj/horde-ad/tree/issue-20-without-plugins and https://github.com/Mikolaj/horde-ad/tree/extra-pragmas-BENCH2-HEAD as described above. The files are ghc-report-specialize.ticky.txt, issue-20-without-plugins.ticky.txt and extra-pragmas-BENCH2-HEAD.ticky.txt. The extra flag when compiling was --ghc-option=-ticky
and when running was +RTS -rbranch_name.ticky.txt
.
The results are quite striking: some numbers repeat exactly, because it's the same benchmark and the code (probably) differs only in types and in some extra components not used in the benchmark. Other numbers are completely different, but I don't know what they imply. E.g., of the tails calls, 100%, 88% and 60% have been to known functions, respectively, with very different absolute numbers. Returns "from entering a new constructor" are 95%, 50%, 0%, again with different absolute numbers. There are 500 times more updates in the worst case (the middle one).
In general, Delta.eval0
and and the functions it calls seems to be the culprit. The lines for that function and it's top-level host evalBindings
are, respectively
818525000 4365472000 0 2 {"type":"entCntr","subTy":"fun","fvs_c":4,"fvs":">MPi","args $weval0{v skDS} (BenchMnistTools) (fun) in rigR
500 10704000 0 8 {"type":"entCntr","subTy":"fun","fvs_c":0,"fvs":"","args":"I BenchMnistTools.$w$s$wevalBindings1{v rigR} (fun)
818525000 15297604000 0 2 {"type":"entCntr","subTy":"fun","fvs_c":17,"fvs":"+>IIIII.+S eval0{v sOOK} (HordeAd.Internal.Delta) (fun) in rOgs
500 11204000 0 9 {"type":"entCntr","subTy":"fun","fvs_c":0,"fvs":"","args":"+ HordeAd.Internal.Delta.$wevalBindings{v rOgs} (fun)
818525000 15297604000 0 2 {"type":"entCntr","subTy":"fun","fvs_c":17,"fvs":"+>IIIII.+S eval0{v sUin} (HordeAd.Internal.Delta) (fun) in rRSA
500 11204000 0 9 {"type":"entCntr","subTy":"fun","fvs_c":0,"fvs":"","args":"+ HordeAd.Internal.Delta.$wevalBindings{v rRSA} (fun)
If I'm guessing correctly, in the fast case, the eval0
function was transformed with a worker/wrapper and somehow inlined in BenchMnistTools
, unlike in the bad case and the desperately repaired bad case. The top level function evalBindings
looks differently, as well, though performs similarly. Is it just that the function got bigger (it now also handles the shaped tensors, though it's dead code in the benchmark) and doesn't inline any more? Or is worker/wrapper transformation not triggered at all for some reason? I think we'd need to look at Core of BenchMnistTools
, but it would help to understand the generated function names and other info we are likely to see.
This is fixed on GHC 9.6.1, except perhaps for <10% discrepancies that have their GHC tickets open. Closing.
In short, while we couldn't monitor performance, because the code only worked with 9.2.2 (#13) and 9.2.2 (and HEAD) had (has) degraded and unpredictable performance (#14), our code got much slower. Neither of the three kinds of heap profiling (-h, -p, -evenlog) show much and bisecting git history to pinpoint the offending changes is impossible, because the code didn't work with good GHC versions.
From branch
ghc-report-specialize
to today's master (snapshot on branchissue-20
) our BENCH2 test from #14 got twice slower and with three times more allocation on GHC 8.10.7 and 9.0.2, surpassing the previous slowdown in GHC 9.2.2, which got somewhat slower and more bloated still. Test BENCH1 slowdown, meanwhile, which was previously worked around with spurious SPECIALIZE pragmas, is not worked around on GHC 9.0.2 (only) any more, though in a less pronounced way, with only 4 times slowdown and 50 times higher allocation.This time. HEAD can't be used for master branch, because the type-level helper plugins we use don't compile even with head.hackage.
I do hope all this gets fixed once #14 is fixed and, ideally, the fix gets backported to 9.2.3. But I wonder how to proceed till then. Any work on optimizations we discussed on meetings during last two weeks is pointless when they can't be measured. Any other work lengthens the trail of commits that need to be benchmarked once we have the situation under control, to verify their performance impact and undo or modify some design decisions (if the impact is significant) and then rebase any further work on that.
I tried some extra proxies and SPECIALIZE pragmas in https://github.com/Mikolaj/horde-ad/tree/extra-pragmas-BENCH2 but to no avail for BENCH1 and with limited success for BENCH2 (and no success at all in 9.2.2). How else one can work around lack of specialization (most likely due to worker-wrapper, but disabling worker-wrapper kills performance almost as badly, see #14)?
Edit: TODO after all unreliable performance issues are fixed: track down and remove any Proxy that was introduced to force specialization via pragma or, if any, via just supplying a concrete type.