Haskell: Ideas to improve compilation time

Bubbler-4 commented 5 years ago

I'm trying to migrate A + B = B + A to Haskell 8. An attempt is here in a kumite.

Turns out that the kata is a monster in both compilation time and running time. Here are several things I tried:

Running as-is (without -O2), the base code barely passes up to 19.
Running with -O2 on Preloaded improves it to 22 or 23.
Running with -O2 on Code or Test either has no significant effect or makes the whole process slower.
Running with -Rghc-timing on Preloaded reveals that lots of time is spent on garbage collection in compile time.

The numbers (19, 22, 23) refer to the two lines at the bottom of Preloaded:

forM [0..23 :: Int] $ \i ->
  forM [0..23 :: Int] $ \j ->

The original kata uses 25 for both i and j, and it is expected to barely pass within 12s.

In particular, garbage collection during compilation can be reduced via specific flags, but the flags aren't accepted in OPTIONS_GHC pragma. Since CW uses stack, I guess we can pass them via stack's ghc-options option.

kazk commented 5 years ago

I tried few variations of compiling the code from your kumite. Looks like we can speed up compilation time a lot! I don't think the increase in memory usage will be a problem.

Here's some profiling outputs like in the blog. I haven't compared with the time between -O2 and -O0. The numbers below are for -O2.

`-N1` (currently deployed; no parallel)

+RTS -s -RTS (no additional flags)

   5,776,027,720 bytes allocated in the heap
   2,038,075,496 bytes copied during GC
     206,386,808 bytes maximum residency (14 sample(s))
       1,730,952 bytes maximum slop
             512 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       375 colls,     0 par    3.370s   3.579s     0.0095s    0.2659s
  Gen  1        14 colls,     0 par    0.008s   0.009s     0.0006s    0.0013s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.252s  (  5.462s elapsed)
  GC      time    3.378s  (  3.587s elapsed)
  EXIT    time    0.019s  (  0.051s elapsed)
  Total   time    7.650s  (  9.101s elapsed)

  Alloc rate    1,358,275,309 bytes per MUT second

  Productivity  55.8% of total user, 60.6% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

+RTS -A128m -n2m -s -RTS (garbage collection configured)

   5,770,371,552 bytes allocated in the heap
   1,543,358,640 bytes copied during GC
     207,217,936 bytes maximum residency (6 sample(s))
       1,538,800 bytes maximum slop
             628 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        37 colls,     0 par    2.292s   2.424s     0.0655s    0.2683s
  Gen  1         6 colls,     0 par    0.217s   0.227s     0.0378s    0.0790s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    4.176s  (  5.333s elapsed)
  GC      time    2.508s  (  2.651s elapsed)
  EXIT    time   -0.013s  ( -0.004s elapsed)
  Total   time    6.671s  (  7.981s elapsed)

  Alloc rate    1,381,940,884 bytes per MUT second

  Productivity  62.4% of total user, 66.8% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

`-N2`

Using -j2 instead of -j because -j uses the number of physical cpu cores and this can be incorrect inside the container.

-j2 +RTS -s -RTS

   5,737,790,560 bytes allocated in the heap
   2,420,328,840 bytes copied during GC
     228,033,936 bytes maximum residency (15 sample(s))
       1,901,000 bytes maximum slop
             626 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       535 colls,   507 par    4.679s   3.424s     0.0064s    0.1504s
  Gen  1        15 colls,    10 par    0.014s   0.009s     0.0006s    0.0014s

  Parallel GC work balance: 25.89% (serial 0%, perfect 100%)

  TASKS: 7 (1 bound, 6 peak workers (6 total), using -N2)

  SPARKS: 13 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 13 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.646s  (  5.572s elapsed)
  GC      time    4.693s  (  3.433s elapsed)
  EXIT    time    0.016s  (  0.044s elapsed)
  Total   time    9.355s  (  9.050s elapsed)

  Alloc rate    1,235,117,572 bytes per MUT second

  Productivity  49.8% of total user, 62.1% of total elapsed

gc_alloc_block_sync: 14973
whitehole_spin: 0
gen[0].sync: 7
gen[1].sync: 203

-j2 +RTS -A128m -n2m -s -RTS

   6,196,760,744 bytes allocated in the heap
     919,157,400 bytes copied during GC
     188,179,584 bytes maximum residency (4 sample(s))
       1,534,848 bytes maximum slop
             683 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        20 colls,    20 par    1.334s   0.716s     0.0358s    0.1423s
  Gen  1         4 colls,     3 par    0.283s   0.152s     0.0380s    0.0556s

  Parallel GC work balance: 85.01% (serial 0%, perfect 100%)

  TASKS: 7 (1 bound, 6 peak workers (6 total), using -N2)

  SPARKS: 13 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 13 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    5.027s  (  5.932s elapsed)
  GC      time    1.617s  (  0.868s elapsed)
  EXIT    time    0.008s  (  0.020s elapsed)
  Total   time    6.653s  (  6.822s elapsed)

  Alloc rate    1,232,573,487 bytes per MUT second

  Productivity  75.7% of total user, 87.3% of total elapsed

gc_alloc_block_sync: 5471
whitehole_spin: 0
gen[0].sync: 9136
gen[1].sync: 1496

Bubbler-4 commented 5 years ago

Cool. Can you check if the original kata (limit to 25 x 25) is passable with -j2 +RTS -A128m -n2m -RTS combined with one of -O0, -O1 or -O2? If we can pass that tests, I believe it'll be enough for now. (Actually the foolproof way would be to go through every single kata that uses GADTs or TemplateHaskell out there, but I couldn't solve most of them yet.)

If it turns out otherwise, we can always return here and throw in -A256m or whatever ;-)

kazk commented 5 years ago

25x25 passes using -j2 +RTS -A128m -n2m -RTS with -O2 or -O1

kazk commented 5 years ago

Deployed this. 25x25 passes, but sometimes it timeouts. So I guess it's now equivalent to the original.

codewars / codewars-runner-cli