:prove command takes a long time and generates too much garbage

This issue was originally reported as a secondary issue in #189. The base cause seems to be LeventErkok/sbv#216. This ticket is intended to track the issue on the Cryptol side.

Consider the following .icry

:s warnDefaulting=off
:l docs/ProgrammingCryptol/aes/AES.cry
:s smtfile=aes.smt
:s prover=offline
:prove AESCorrect

This takes about 4 minutes to run on my machine, with ~95% of that time being spent in the garbage collector.

  17,622,211,064 bytes allocated in the heap
   5,294,006,176 bytes copied during GC
     834,371,400 bytes maximum residency (19 sample(s))
       8,675,560 bytes maximum slop
            1639 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     33428 colls,     0 par   222.156s  229.522s     0.0069s    0.0560s
  Gen  1        19 colls,     0 par    3.197s   3.952s     0.2080s    1.0058s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time   13.250s  ( 13.998s elapsed)
  GC      time  225.353s  (233.474s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.018s  (  0.144s elapsed)
  Total   time  238.623s  (247.617s elapsed)

  Alloc rate    1,329,990,840 bytes per MUT second

  Productivity   5.6% of total user, 5.4% of total elapsed

The time profile implicates the SBV SMTLib pretty-printing code.

    Fri Jul 15 13:46 2016 Time and Allocation Profiling Report  (Final)

       cryptol +RTS -N1 -p -s -RTS -b aes.icry

    total time  =       13.18 secs   (13182 ticks @ 1000 us, 1 processor)
    total alloc = 10,807,675,152 bytes  (excludes profiling overheads)

COST CENTRE             MODULE                         %time %alloc

uncacheGen              Data.SBV.BitVectors.Symbolic    13.1    5.7
compare                 Data.SBV.BitVectors.Symbolic     8.7    0.0
show                    Data.SBV.BitVectors.Symbolic     7.5   12.9
svSymbolicMerge.c       Data.SBV.BitVectors.Operations   7.4    4.9
finiteSeqMap.\          Cryptol.Eval.Value               5.1    7.7
cvt.mkLet               Data.SBV.SMT.SMTLib2             4.7   13.0
>>=.\                   Cryptol.REPL.Monad               4.7    0.1
compare                 Data.SBV.BitVectors.Kind         4.0    0.5
compileToSMTLib         Data.SBV.Provers.Prover          3.7   12.0
compare                 Data.SBV.BitVectors.Symbolic     3.6    0.0
mkSymOpSC               Data.SBV.BitVectors.Operations   2.9    2.1
cvtExp.sh               Data.SBV.SMT.SMTLib2             2.5    5.5
compare                 Data.SBV.BitVectors.Symbolic     2.4    0.0
rnf                     Data.SBV.BitVectors.Symbolic     2.1    0.0
cvtExp                  Data.SBV.SMT.SMTLib2             2.0    7.1
show                    Data.SBV.BitVectors.Symbolic     1.8    4.5
svToSW                  Data.SBV.BitVectors.Symbolic     1.8    0.1
mkSymOp1SC              Data.SBV.BitVectors.Operations   1.6    1.3
kindRank                Data.SBV.BitVectors.Kind         1.1    1.3
svSymbolicMerge         Data.SBV.BitVectors.Operations   0.8    2.1
cvtExp.lift2            Data.SBV.SMT.SMTLib2             0.7    1.8
cvt.pre                 Data.SBV.SMT.SMTLib2             0.7    1.8
cvt.letAlign            Data.SBV.SMT.SMTLib2             0.4    1.1
extendSValPathCondition Data.SBV.BitVectors.Symbolic     0.4    2.4

Heap profiles corroborate the diagnosis that the String-based pretty printer generates a lot of garbage.

@robdockins Based on this, should we be adding some INLINE directives for uncacheGen, uncache and uncacheAI in Data/SBV/Symbolic.hs?

It'd be great if you can do a profiling run after inlining those calls.

Also, what function does the compare call-center refer to in that list? (The second item in the list.) Can we do something about that?

The plot thickens.

Adding inline declarations on the uncache functions make no perceptible difference.
It is very hard to figure out which compare is being called, as those cost centers are auto-generated via deriving Ord, and there doesn't seem to be any way to give them different names.
Changing the SBV pretty printer to use text builders instead of strings made considerably less impact than I'd hoped. The heap profile is now dominated by various uncache cost centers during the simulation phase, instead of being dominated by printing centers in the printing phase. However, the allocation rate is only a little improved, the mutator productivity remains stubbornly between 5-6%, and runtimes are almost unchanged.

GC stats and time profile with these changes:

  16,095,584,296 bytes allocated in the heap
  53,485,495,392 bytes copied during GC
     618,641,872 bytes maximum residency (172 sample(s))
      10,697,568 bytes maximum slop
            1213 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     30030 colls,     0 par   173.037s  179.486s     0.0060s    0.0565s
  Gen  1       172 colls,     0 par   112.809s  117.104s     0.6808s    1.9822s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.003s elapsed)
  MUT     time   15.275s  ( 44.343s elapsed)
  GC      time  258.639s  (268.538s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time   27.207s  ( 28.052s elapsed)
  EXIT    time    0.007s  (  0.015s elapsed)
  Total   time  301.130s  (312.899s elapsed)

  Alloc rate    1,053,735,088 bytes per MUT second

  Productivity   5.1% of total user, 4.9% of total elapsed

        Mon Jul 18 18:28 2016 Time and Allocation Profiling Report  (Final)

           cryptol +RTS -N1 -p -s -hc -RTS -b aes.icry

        total time  =       14.06 secs   (14058 ticks @ 1000 us, 1 processor)
        total alloc = 9,881,047,872 bytes  (excludes profiling overheads)

COST CENTRE             MODULE                         %time %alloc

uncacheGen              Data.SBV.BitVectors.Symbolic    15.0    6.2
compare                 Data.SBV.BitVectors.Symbolic     9.5    0.0
svSymbolicMerge.c       Data.SBV.BitVectors.Operations   8.5    5.4
writeTerm               Data.SBV.Utils.Term              6.6   15.8
mappend                 Data.SBV.Utils.Term              6.1    5.8
finiteSeqMap.\          Cryptol.Eval.Value               4.8    8.4
compare                 Data.SBV.BitVectors.Symbolic     3.9    8.3
mkSymOpSC               Data.SBV.BitVectors.Operations   3.8    2.4
compare                 Data.SBV.BitVectors.Kind         3.8    0.6
cvtExp.sh               Data.SBV.SMT.SMTLib2             2.9    3.0
compare                 Data.SBV.BitVectors.Symbolic     2.7    0.0
cvt.mkLet               Data.SBV.SMT.SMTLib2             2.7    4.8
svToSW                  Data.SBV.BitVectors.Symbolic     2.1    0.1
cvtExp                  Data.SBV.SMT.SMTLib2             2.0    7.7
mkSymOp1SC              Data.SBV.BitVectors.Operations   2.0    1.4
show                    Data.SBV.BitVectors.Symbolic     1.7    4.9
kindRank                Data.SBV.BitVectors.Kind         1.5    1.5
cvtExp.lift2            Data.SBV.SMT.SMTLib2             0.9    1.0
==                      Data.SBV.BitVectors.Symbolic     0.8    1.1
svSymbolicMerge         Data.SBV.BitVectors.Operations   0.7    2.3
extendSValPathCondition Data.SBV.BitVectors.Symbolic     0.5    2.6
cvt.pre                 Data.SBV.SMT.SMTLib2             0.4    1.5

It's not clear how to proceed, but I'm no longer convinced that SBV's pretty-printer is the true source of the problem here.

I'll rerun some tests after building without profiling enabled to see if profiling is skewing things, but I'm not hopeful.

@robdockins Your observation is inline with my experience with replacing the pretty printer.

I actually had a branch for playing with the non-string pretty printer, sorry forgot to mention earlier: https://github.com/LeventErkok/sbv/tree/no-string-backend

But, as you mentioned, I saw no improvement moving from String to anything more structured; I'm guessing the creation/printing of the intermediate data-structure (no matter how efficient) still dominates there.

OK, final data point here for now, as I need to move on to other things.

When compiled without profiling, changing to use text builder from strings produces significant, but not mind-blowing, reductions in runtime (~35%) and allocations (~25%).

With strings:

  11,503,920,136 bytes allocated in the heap
   3,245,316,680 bytes copied during GC
     552,183,072 bytes maximum residency (17 sample(s))
       5,901,664 bytes maximum slop
            1084 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22356 colls,     0 par   144.236s  146.950s     0.0066s    0.0314s
  Gen  1        17 colls,     0 par    2.653s   3.099s     0.1823s    0.9022s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    6.533s  (  7.052s elapsed)
  GC      time  146.889s  (150.050s elapsed)
  EXIT    time    0.011s  (  0.114s elapsed)
  Total   time  153.434s  (157.216s elapsed)

  Alloc rate    1,760,966,452 bytes per MUT second

  Productivity   4.3% of total user, 4.2% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    2m37.235s
user    2m33.606s
sys 0m2.087s

With text builder:

   8,738,788,488 bytes allocated in the heap
   3,104,975,312 bytes copied during GC
     365,954,984 bytes maximum residency (19 sample(s))
      25,390,560 bytes maximum slop
             873 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     16035 colls,     0 par   90.790s  91.943s     0.0057s    0.0202s
  Gen  1        19 colls,     0 par    3.638s   4.077s     0.2146s    1.1450s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.008s elapsed)
  MUT     time    5.362s  (  5.922s elapsed)
  GC      time   94.428s  ( 96.020s elapsed)
  EXIT    time    0.007s  (  0.041s elapsed)
  Total   time   99.799s  (101.990s elapsed)

  Alloc rate    1,629,785,966 bytes per MUT second

  Productivity   5.4% of total user, 5.3% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    1m42.022s
user    1m39.972s
sys 0m1.449s

The fix for this turns out to be surprisingly simple (2e15d4f443b8f950ef003920f9bedd124de6b72d). Increasing the size of the Gen0 nursery to 64m (up from the 512k default) brings mutator productivity up to around 50%, and reduces the runtime on this example to less than 12 seconds. The various space-improving changes to SBV have a negligible impact once this is done.

Continuing to increase the nursery size provides additional, but diminishing, improvements up to about 512m, where it seems to level off.

Here are GC stats and runtime after modifying the nursery size to 64m:

  11,472,840,664 bytes allocated in the heap
   2,609,647,512 bytes copied during GC
     447,766,712 bytes maximum residency (9 sample(s))
       3,595,512 bytes maximum slop
             934 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       161 colls,     0 par    3.735s   3.845s     0.0239s    0.1209s
  Gen  1         9 colls,     0 par    1.805s   2.101s     0.2334s    0.5507s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.002s elapsed)
  MUT     time    5.479s  (  5.858s elapsed)
  GC      time    5.540s  (  5.946s elapsed)
  EXIT    time    0.012s  (  0.066s elapsed)
  Total   time   11.033s  ( 11.871s elapsed)

  Alloc rate    2,093,878,657 bytes per MUT second

  Productivity  49.8% of total user, 46.3% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    0m11.908s
user    0m11.202s
sys 0m0.707s

I think this is sufficient improvement to close this ticket.

GaloisInc / cryptol

:prove command takes a long time and generates too much garbage #354