Interaction performance overhead

johnbender commented 5 days ago

Hello!

We're working with terms that produce a lot of top level SMTLib definitions and we noticed that the interaction overhead with the solver appears to quickly dominate the solver performance.

As an example, the program below generates many terms at the top level but when passed quant it ensures that the asserted term is wrapped in a universal quantifier thereby forcing the individual conjuncts into one interaction with the solver. The result is a significant performance improvement.

We spent some time looking through the source and the documentation but didn't spot an obvious way to tune the interactions and we expect there are good reasons for this approach. With that in mind we were wondering if an option to disable the call to lines in runSolver was in the offing (or even made sense) or alternately some other likely much more intelligent approach.

Thanks very much!

(SBV 8.15/GHC 8.10.7)

import Data.SBV                                                                           
import System.Environment (getArgs)                                                       

generateTerm :: Bool -> Symbolic SBool                                                    
generateTerm quant = do                                                                   
  let conj = \x -> foldr (\i acc -> acc .&& x .== (i :: SWord16)) sTrue (take 10000 [1..])
  inner <- forAll ["x"] conj                                                              
  if quant                                                                                
  then forSome ["y"] (\y -> (y :: SBool) .== y .&& inner)                                 
  else return inner                                                                       

config trns =                                                                             
  z3 { transcript = if trns then Just "/tmp/sbv.minimal.out" else Nothing,                
       allowQuantifiedQueries = True }                                                    

main :: IO ()                                                                             
main = do                                                                                 
  args <- getArgs                                                                         
  let quant = elem "quant" args                                                           
  let trans = elem "transcript" args                                                      
  proveWith (config trans) (generateTerm quant) >>= print

Timing result:

$ time example
Falsifiable. Counter-example:
  x  = 32066 :: Word16
  s1 = 32066 :: Word16

real    0m8.379s
user    0m5.149s
sys     0m2.464s
$ time example quant 
Falsifiable. Counter-example:
  x  = 0 :: Word16
  s1 = 0 :: Word16

real    0m2.549s
user    0m2.271s
sys     0m0.592s

LeventErkok commented 5 days ago

Thanks for the report.

Unfortunately I cannot really replicate this to see if we can improve it in any way, because it uses a pretty old version of SBV. In particular, the handling of quantifiers have changed significantly, and your program is no longer valid. (I can't really go back and fix things about older releases.)

If you could send a test case against the latest SBV release, I can take a closer look to see if things can be improved. Do not worry about making it go fast/slow; just create a problem that takes too long to send to z3, and I'll see what the bottleneck might be.

johnbender commented 5 days ago

Updated for testing with v10.12. The case with the quantifier takes about half the time of the one with the conjuncts. Our suspicions about the cause remain the same but we are grateful for your time and attention either way!

import Data.SBV
import System.Environment (getArgs)

termCount = 20000

generateTerm quant = do
  -- (\(Forall x) -> foldr (\i acc -> acc .&& x .== (i :: SWord16)) sTrue (take termCount [1..]))
  (\x -> foldr (\i acc -> acc .&& x .== (i :: SWord16)) sTrue (take termCount [1..]))

config trns =
  z3 { transcript = if trns then Just "/tmp/sbv.minimal.out" else Nothing }

main :: IO ()
main = do
  args <- getArgs
  let quant = elem "quant" args
  let trans = elem "transcript" args
  proveWith (config trans) (generateTerm quant) >>= print

LeventErkok commented 4 days ago

Is this the correct program?

Note that your generateTerm does not use the quant parameter as it stands.

johnbender commented 4 days ago

Yep this is correct, just a leftover from modifying the original. The commented term is there for comparison.

LeventErkok commented 4 days ago

It'd be nice if there was a way to compare fast/slow versions; if you can craft something like that. But this is OK too. My understanding is that this benchmark is way too slow for your use case. Correct? Any way you can create a slow/fast variants with the latest SBV?

johnbender commented 4 days ago

Sorry, I took "Do not worry about making it go fast/slow" to mean that I shouldn't worry about including both in the program. The following will work with the quant argument again.

import Data.SBV
import System.Environment (getArgs)

termCount = 20000

config trns =
  z3 { transcript = if trns then Just "/tmp/sbv.minimal.out" else Nothing }

main :: IO ()
main = do
  args <- getArgs
  let quant = elem "quant" args
  let trans = elem "transcript" args
  (if quant
   then proveWith (config trans) (\(Forall x) -> foldr (\i acc -> acc .&& x .== (i :: SWord16)) sTrue (take termCount [1..]))
   else proveWith (config trans) (\x -> foldr (\i acc -> acc .&& x .== (i :: SWord16)) sTrue (take termCount [1..])))
     >>= print

johnbender commented 4 days ago

Ah and to answer your question about our use case, we have many queries that involve a large top level conjunct and the overhead demonstrated here is roughly 3x. The total time in the best case for the queries is already quite long (many minutes) so the increase is expensive in practice.

LeventErkok commented 4 days ago

Cool. Let me take a look and see if I can spot any possible improvements/work-arounds.

LeventErkok commented 4 days ago

What I'm seeing is that the quant version runs much slower:

$ time ./a quant
Falsifiable
./a quant  16.58s user 1.13s system 97% cpu 18.146 total
$ time ./a
Falsifiable. Counter-example:
  s0 = 52768 :: Word16
./a  5.57s user 0.96s system 102% cpu 6.345 total

I thought you said the quantified version ran faster. Am I not running this correctly?

johnbender commented 4 days ago

That's the correct way to run it but I'm getting a significantly different timing result. What version of z3 do you have? I'm running 4.12.2.

LeventErkok commented 4 days ago

$ z3 --version
Z3 version 4.13.4 - 64 bit

I actually compile z3 from sources; this one was compiled about a week ago

johnbender commented 4 days ago

Tried with 4.13.3 and same result. If I do a profile build would it be helpful to attach that here?

LeventErkok commented 4 days ago

Before we get to profiling..

How did you compile the sources? I did a straight ghc -o a a.hs. There has been cases where ghc -O2 produced worse performing code in the past. Wonder if it is one of those cases.

johnbender commented 4 days ago

I'm compiling within an existing stack project. This suggests a question about what flags we are using and we have the following in our stack.yaml

ghc-options:                                    
- -threaded                                     
- -rtsopts                                      
- -with-rtsopts=-N1 # default to a single thread

When I remove those the -threaded flag (and also the -with-rtsopts=-N1 since its only valid when -threaded is present) the times are now the same (that is the quant variant is no longer faster), though they do not match the times you are seeing.

LeventErkok commented 4 days ago

Interesting. Honestly, I've no idea why those options would matter.

Of course, we can also treat this as simply "this should run faster." In that case, I assume you'd want to focus on the non-quantifier version to run faster?

johnbender commented 4 days ago

I suppose but separately I can't understand why the non-quantified version would run so much faster on your machine? I'm working on a 3ghz Xeon with 100gb of ram and the best I can get is ~5s with 20000 for the termCount. Do your results change when adding the -threaded?

Correction: I was initially confused by the single line output but it looks comparable. I am interested in how -threaded affects your timing results. Could it be that the machine over here handles the quantified case fast enough that the solver time is negligible in our the comparison here?

LeventErkok commented 4 days ago

Even with -threaded, quant runs a lot slower for me.

I'm on a 5 year old laptop with a mere 16G of memory. Not that it should matter.

johnbender commented 4 days ago

Could you try printing the transcript for both examples (potentially with a much smaller term count) and running Z3 on it directly to see how long the solving there takes? Our observation was that with a transcript and handing it directly to Z3 the result was nearly instant for both. I am sort of guessing that your solving for the quantified example takes roughly 9 seconds (the difference between the two) while on our end the solving might be faster and that's why we don't see that quantified example as being much larger.

Note the transcript file name doesn't change with the command line arg so you'll have to move the transcript after one of them if you want to keep it.

LeventErkok commented 4 days ago

z3 seems to be spending a bit more time on the quantified version; but definitely not something that can account for the difference:

$ time z3 quant.smt2 > /dev/null
z3 quant.smt2 > /dev/null  1.50s user 0.12s system 97% cpu 1.659 total
$ time z3 noquant.smt2 > /dev/null
z3 noquant.smt2 > /dev/null  0.84s user 0.17s system 97% cpu 1.030 total

johnbender commented 4 days ago

I'm not sure of what to make of the 14 seconds of overhead for the quant example and the 4 seconds of overhead for the conjunct example beyond our original hypothesis that the communication/interaction with the solver seems like a bottleneck?

My experiments attempting to batch the SMTLib strings in to fewer communications didn't make much progress (with the limited time I applied to them) because of expectations in runSolver about responses.

LeventErkok commented 4 days ago

It's hard to tell where the bottleneck really is. I suppose it's time to profile. Can you do a profile run? Let's see what gets called a lot.

johnbender commented 4 days ago

Attached are both strace profiles and basic profiling information for both quant and no quant runs.

A few notes:

Profiling with ghc profiling enabled does not include system call overhead so I included a basic strace profile.
The strace profile without the quantifier the system calls take 1.2 of 4 seconds on my system and the profile with the quantifier is 0.2 seconds. To rerun for yourself strace -c ./a quant (I believe).

example-quant.strace.txt example-quant.prof.txt example-no-quant.strace.txt example-no-quant.prof.txt

LeventErkok commented 4 days ago

Thanks..

I need to look at the profile data in more detail. But at a first glance, looks like uniplate calls are getting really expensive. SBV performs multiple traversals over the final program, and this seems to add up when you have huge programs.

Some of these traversals just ensure that we haven't screwed anything up. (i.e., making sure internal SBV invariants hold over the program.) Unfortunately there's no way to turn these off.

Some of these are used to determine certain data. For instance, to determine which logic to use, we do traversals to figure out what features have been used. Not all of this can be avoided, but perhaps an explicit call to setLogic can help here: If you call setLogic yourself then SBV will skip some of these expensive checks and will just take the user given logic.

I don't think this'll save much; perhaps nothing at all.. But can you try by calling setLogic explicitly yourself to see if it makes any difference?

Again, I need to look at the profile further to see where else we can cut. I imagine there'll be a configuration parameter of the form skipChecks or some sort, which if the user sets, will cause SBV to skip some of these checks. Exactly where to put these, and whether they'll have an impact is TBD.

Thanks for collecting the data. I'd appreciate if you do the setLogic experiment to see if it moves the needle at all.

johnbender commented 3 days ago

I didn't test all the possible logic values but for many of the quantifier free options using the SetLogic solver option significantly increases the execution time (~4 seconds to ~24seconds).

LeventErkok commented 3 days ago

I should’ve been clearer. I was asking for just trying one logic setting. QF_BV for unquantified and ALL in the quantifier case.

In any case, that’s unlikely to be the root cause anyhow. I need to look into more of the details of the profile. Alas I’m traveling for the next two weeks, so this’ll take a while.

johnbender commented 3 days ago

Safe travels I hope it's an enjoyable trip! I'll keep hacking on my end and post here if I have anything useful to add.

LeventErkok / sbv

Interaction performance overhead #727