Closed robdockins closed 8 years ago
@robdockins Based on this, should we be adding some INLINE
directives for uncacheGen
, uncache
and uncacheAI
in Data/SBV/Symbolic.hs
?
It'd be great if you can do a profiling run after inlining those calls.
Also, what function does the compare
call-center refer to in that list? (The second item in the list.) Can we do something about that?
The plot thickens.
uncache
functions make no perceptible difference.compare
is being called, as those cost centers are auto-generated via deriving Ord
, and there doesn't seem to be any way to give them different names.GC stats and time profile with these changes:
16,095,584,296 bytes allocated in the heap
53,485,495,392 bytes copied during GC
618,641,872 bytes maximum residency (172 sample(s))
10,697,568 bytes maximum slop
1213 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 30030 colls, 0 par 173.037s 179.486s 0.0060s 0.0565s
Gen 1 172 colls, 0 par 112.809s 117.104s 0.6808s 1.9822s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.003s elapsed)
MUT time 15.275s ( 44.343s elapsed)
GC time 258.639s (268.538s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 27.207s ( 28.052s elapsed)
EXIT time 0.007s ( 0.015s elapsed)
Total time 301.130s (312.899s elapsed)
Alloc rate 1,053,735,088 bytes per MUT second
Productivity 5.1% of total user, 4.9% of total elapsed
Mon Jul 18 18:28 2016 Time and Allocation Profiling Report (Final)
cryptol +RTS -N1 -p -s -hc -RTS -b aes.icry
total time = 14.06 secs (14058 ticks @ 1000 us, 1 processor)
total alloc = 9,881,047,872 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
uncacheGen Data.SBV.BitVectors.Symbolic 15.0 6.2
compare Data.SBV.BitVectors.Symbolic 9.5 0.0
svSymbolicMerge.c Data.SBV.BitVectors.Operations 8.5 5.4
writeTerm Data.SBV.Utils.Term 6.6 15.8
mappend Data.SBV.Utils.Term 6.1 5.8
finiteSeqMap.\ Cryptol.Eval.Value 4.8 8.4
compare Data.SBV.BitVectors.Symbolic 3.9 8.3
mkSymOpSC Data.SBV.BitVectors.Operations 3.8 2.4
compare Data.SBV.BitVectors.Kind 3.8 0.6
cvtExp.sh Data.SBV.SMT.SMTLib2 2.9 3.0
compare Data.SBV.BitVectors.Symbolic 2.7 0.0
cvt.mkLet Data.SBV.SMT.SMTLib2 2.7 4.8
svToSW Data.SBV.BitVectors.Symbolic 2.1 0.1
cvtExp Data.SBV.SMT.SMTLib2 2.0 7.7
mkSymOp1SC Data.SBV.BitVectors.Operations 2.0 1.4
show Data.SBV.BitVectors.Symbolic 1.7 4.9
kindRank Data.SBV.BitVectors.Kind 1.5 1.5
cvtExp.lift2 Data.SBV.SMT.SMTLib2 0.9 1.0
== Data.SBV.BitVectors.Symbolic 0.8 1.1
svSymbolicMerge Data.SBV.BitVectors.Operations 0.7 2.3
extendSValPathCondition Data.SBV.BitVectors.Symbolic 0.5 2.6
cvt.pre Data.SBV.SMT.SMTLib2 0.4 1.5
It's not clear how to proceed, but I'm no longer convinced that SBV's pretty-printer is the true source of the problem here.
I'll rerun some tests after building without profiling enabled to see if profiling is skewing things, but I'm not hopeful.
@robdockins Your observation is inline with my experience with replacing the pretty printer.
I actually had a branch for playing with the non-string pretty printer, sorry forgot to mention earlier: https://github.com/LeventErkok/sbv/tree/no-string-backend
But, as you mentioned, I saw no improvement moving from String to anything more structured; I'm guessing the creation/printing of the intermediate data-structure (no matter how efficient) still dominates there.
OK, final data point here for now, as I need to move on to other things.
When compiled without profiling, changing to use text builder from strings produces significant, but not mind-blowing, reductions in runtime (~35%) and allocations (~25%).
With strings:
11,503,920,136 bytes allocated in the heap
3,245,316,680 bytes copied during GC
552,183,072 bytes maximum residency (17 sample(s))
5,901,664 bytes maximum slop
1084 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 22356 colls, 0 par 144.236s 146.950s 0.0066s 0.0314s
Gen 1 17 colls, 0 par 2.653s 3.099s 0.1823s 0.9022s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 6.533s ( 7.052s elapsed)
GC time 146.889s (150.050s elapsed)
EXIT time 0.011s ( 0.114s elapsed)
Total time 153.434s (157.216s elapsed)
Alloc rate 1,760,966,452 bytes per MUT second
Productivity 4.3% of total user, 4.2% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 2m37.235s
user 2m33.606s
sys 0m2.087s
With text builder:
8,738,788,488 bytes allocated in the heap
3,104,975,312 bytes copied during GC
365,954,984 bytes maximum residency (19 sample(s))
25,390,560 bytes maximum slop
873 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 16035 colls, 0 par 90.790s 91.943s 0.0057s 0.0202s
Gen 1 19 colls, 0 par 3.638s 4.077s 0.2146s 1.1450s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.008s elapsed)
MUT time 5.362s ( 5.922s elapsed)
GC time 94.428s ( 96.020s elapsed)
EXIT time 0.007s ( 0.041s elapsed)
Total time 99.799s (101.990s elapsed)
Alloc rate 1,629,785,966 bytes per MUT second
Productivity 5.4% of total user, 5.3% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 1m42.022s
user 1m39.972s
sys 0m1.449s
The fix for this turns out to be surprisingly simple (2e15d4f443b8f950ef003920f9bedd124de6b72d). Increasing the size of the Gen0 nursery to 64m (up from the 512k default) brings mutator productivity up to around 50%, and reduces the runtime on this example to less than 12 seconds. The various space-improving changes to SBV have a negligible impact once this is done.
Continuing to increase the nursery size provides additional, but diminishing, improvements up to about 512m, where it seems to level off.
Here are GC stats and runtime after modifying the nursery size to 64m:
11,472,840,664 bytes allocated in the heap
2,609,647,512 bytes copied during GC
447,766,712 bytes maximum residency (9 sample(s))
3,595,512 bytes maximum slop
934 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 3.735s 3.845s 0.0239s 0.1209s
Gen 1 9 colls, 0 par 1.805s 2.101s 0.2334s 0.5507s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.002s elapsed)
MUT time 5.479s ( 5.858s elapsed)
GC time 5.540s ( 5.946s elapsed)
EXIT time 0.012s ( 0.066s elapsed)
Total time 11.033s ( 11.871s elapsed)
Alloc rate 2,093,878,657 bytes per MUT second
Productivity 49.8% of total user, 46.3% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 0m11.908s
user 0m11.202s
sys 0m0.707s
I think this is sufficient improvement to close this ticket.
This issue was originally reported as a secondary issue in #189. The base cause seems to be LeventErkok/sbv#216. This ticket is intended to track the issue on the Cryptol side.
Consider the following .icry
This takes about 4 minutes to run on my machine, with ~95% of that time being spent in the garbage collector.
The time profile implicates the SBV SMTLib pretty-printing code.
Heap profiles corroborate the diagnosis that the String-based pretty printer generates a lot of garbage.