Closed Bubbler-4 closed 5 years ago
I tried few variations of compiling the code from your kumite. Looks like we can speed up compilation time a lot! I don't think the increase in memory usage will be a problem.
Here's some profiling outputs like in the blog. I haven't compared with the time between -O2
and -O0
. The numbers below are for -O2
.
-N1
(currently deployed; no parallel)+RTS -s -RTS
(no additional flags)
5,776,027,720 bytes allocated in the heap
2,038,075,496 bytes copied during GC
206,386,808 bytes maximum residency (14 sample(s))
1,730,952 bytes maximum slop
512 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 375 colls, 0 par 3.370s 3.579s 0.0095s 0.2659s
Gen 1 14 colls, 0 par 0.008s 0.009s 0.0006s 0.0013s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.252s ( 5.462s elapsed)
GC time 3.378s ( 3.587s elapsed)
EXIT time 0.019s ( 0.051s elapsed)
Total time 7.650s ( 9.101s elapsed)
Alloc rate 1,358,275,309 bytes per MUT second
Productivity 55.8% of total user, 60.6% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
+RTS -A128m -n2m -s -RTS
(garbage collection configured)
5,770,371,552 bytes allocated in the heap
1,543,358,640 bytes copied during GC
207,217,936 bytes maximum residency (6 sample(s))
1,538,800 bytes maximum slop
628 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 37 colls, 0 par 2.292s 2.424s 0.0655s 0.2683s
Gen 1 6 colls, 0 par 0.217s 0.227s 0.0378s 0.0790s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 4.176s ( 5.333s elapsed)
GC time 2.508s ( 2.651s elapsed)
EXIT time -0.013s ( -0.004s elapsed)
Total time 6.671s ( 7.981s elapsed)
Alloc rate 1,381,940,884 bytes per MUT second
Productivity 62.4% of total user, 66.8% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-N2
Using -j2
instead of -j
because -j
uses the number of physical cpu cores and this can be incorrect inside the container.
-j2 +RTS -s -RTS
5,737,790,560 bytes allocated in the heap
2,420,328,840 bytes copied during GC
228,033,936 bytes maximum residency (15 sample(s))
1,901,000 bytes maximum slop
626 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 535 colls, 507 par 4.679s 3.424s 0.0064s 0.1504s
Gen 1 15 colls, 10 par 0.014s 0.009s 0.0006s 0.0014s
Parallel GC work balance: 25.89% (serial 0%, perfect 100%)
TASKS: 7 (1 bound, 6 peak workers (6 total), using -N2)
SPARKS: 13 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 13 fizzled)
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.646s ( 5.572s elapsed)
GC time 4.693s ( 3.433s elapsed)
EXIT time 0.016s ( 0.044s elapsed)
Total time 9.355s ( 9.050s elapsed)
Alloc rate 1,235,117,572 bytes per MUT second
Productivity 49.8% of total user, 62.1% of total elapsed
gc_alloc_block_sync: 14973
whitehole_spin: 0
gen[0].sync: 7
gen[1].sync: 203
-j2 +RTS -A128m -n2m -s -RTS
6,196,760,744 bytes allocated in the heap
919,157,400 bytes copied during GC
188,179,584 bytes maximum residency (4 sample(s))
1,534,848 bytes maximum slop
683 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 20 colls, 20 par 1.334s 0.716s 0.0358s 0.1423s
Gen 1 4 colls, 3 par 0.283s 0.152s 0.0380s 0.0556s
Parallel GC work balance: 85.01% (serial 0%, perfect 100%)
TASKS: 7 (1 bound, 6 peak workers (6 total), using -N2)
SPARKS: 13 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 13 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 5.027s ( 5.932s elapsed)
GC time 1.617s ( 0.868s elapsed)
EXIT time 0.008s ( 0.020s elapsed)
Total time 6.653s ( 6.822s elapsed)
Alloc rate 1,232,573,487 bytes per MUT second
Productivity 75.7% of total user, 87.3% of total elapsed
gc_alloc_block_sync: 5471
whitehole_spin: 0
gen[0].sync: 9136
gen[1].sync: 1496
Cool. Can you check if the original kata (limit to 25 x 25) is passable with -j2 +RTS -A128m -n2m -RTS
combined with one of -O0
, -O1
or -O2
? If we can pass that tests, I believe it'll be enough for now. (Actually the foolproof way would be to go through every single kata that uses GADTs
or TemplateHaskell
out there, but I couldn't solve most of them yet.)
If it turns out otherwise, we can always return here and throw in -A256m
or whatever ;-)
25x25 passes using -j2 +RTS -A128m -n2m -RTS
with -O2
or -O1
Deployed this. 25x25 passes, but sometimes it timeouts. So I guess it's now equivalent to the original.
I'm trying to migrate A + B = B + A to Haskell 8. An attempt is here in a kumite.
Turns out that the kata is a monster in both compilation time and running time. Here are several things I tried:
-O2
), the base code barely passes up to 19.-O2
on Preloaded improves it to 22 or 23.-O2
on Code or Test either has no significant effect or makes the whole process slower.-Rghc-timing
on Preloaded reveals that lots of time is spent on garbage collection in compile time.The numbers (19, 22, 23) refer to the two lines at the bottom of Preloaded:
The original kata uses 25 for both
i
andj
, and it is expected to barely pass within 12s.In particular, garbage collection during compilation can be reduced via specific flags, but the flags aren't accepted in
OPTIONS_GHC
pragma. Since CW uses stack, I guess we can pass them via stack'sghc-options
option.