Open zvonimir opened 4 years ago
Ok, this seems to be due to switching the default arithmetic solver to lra
instead of simplex
. When I revert to simplex
it terminates fast. Is this expected or is this maybe a performance bug?
Could you maybe comment on lra
vs simplex
trade-offs a little bit, and also what your suggestion would be we should do?
I am sure you had good reason to switch to lra
, but it seems that on our benchmarks, and even on simple one, simplex
performs better. Thanks!
Some notes for diagnosis:
@zvonimir , it seems fixed in the current release. Can you give it a shot?
It appears to be a bit too early to test the change, but once we have figured this out it will be good with a more comprehensive validation, especially on small examples like this one that is more feasible to analyze.
I can test it on our regressions as soon as you think it is ready. Is it ready btw (sorry, the two messages above confused me)? Also, do you still have a nightly release (can't find it any more) or should I compile it from scratch on my own? Thanks!
I tried the latest change. It still diverges depending on which random seed is used. So the change doesn't address the main cause and I posted a head's up that you will run into similar regressions. The change seems for the better (it avoids some heuristics when there are big-nums).
Got it. So we'll wait until you stabilize this a bit more. Thx!
Now it seems really fixed.
Thanks @levnach! For @zvonimir: if you get a chance to try other regression tests with the default arithmetic solver setting it would be great to know about other places that are meaningful to your uses. We want to get the new arithmetic solver used in mainstream so weeding out issues that are relevant to users is a priority.
Thanks! I tried the new nightly build, and it is for sure working better in terms of performance/timeouts. However, there are still two benchmarks that are timing out, and they are terminating fine with the old arithmetic decision procedure. I am attaching the smaller one here: pero.txt Note that this one is a bit less straightforward in terms of what is causing this performance degradation since it contains quantifiers, etc. Please let me know what you think. Thanks!
Quick update. I run some more extensive experiments beyond just our small regressions. I am noticing quite a bit more timeouts with the new arithmetic solver on the same (or very similar) kinds of problems, meaning problems that generated queries similar to the one I posted just above.
Ok, thanks. We will carve out some time to go through this. If you have a chance to assemble a few more samples it could help to prepare for this examination.
I pushed a change that improves the picture somewhat on the attached example. It possibly doesn't address the root cause, but debugging the file revealed a way the solver could end up generating useless lemmas.
Your changes helped for our regressions. Since then, I run the nightly version on thousands of SVCOMP benchmarks, and we are seeing around 5% or so more timeouts now. I don't see a clear pattern among the newly introduced timeouts to be able to pick good representative queries. However, I am attaching here this one: pero.txt As is the case with the previous queries, the old Z3 solves it super fast, while the new on chokes on it for some reason. Maybe you can take a look at it and see if you can spot anything obvious.
Oh, another weird thing about this one is that setting smt.relevancy
to 0
or not makes a difference, which I thought should not be the case since the query contains no quantifiers. Any hints as to why that is so?
Thanks!
So I finally run our large-scale experiments and did a comparison between old arithmetic solver and the new one in the nightly build. Right now the differences seem to be minimal. Here is an example: http://node0.smackbenchwebserver.smack.emulab.net/results.py?runset=runs%2Fexec_2020.12.19_13.19.02.447419_SoftwareSystems-DeviceDriversLinux64-ReachSafety&runset=runs%2Fexec_2020.12.15_10.44.41.142974_SoftwareSystems-DeviceDriversLinux64-ReachSafety#/
In terms of benchmarks that did not time out, the result is 1367 (new solver) vs 1372 (old solver). So the old solver is still a bit better, but the differences are pretty negligible it seems.
So we will make the switch I hope to the new solver as soon as you make your next release.
Thanks for your help!
Nice benchmarking infrastructure! The scatter plot looks a little too uniform to be believable. So there is a chance that some parameters got ignored or overwritten or omitted. I typically look for samples where the differences are starkest and work from these. It is true I have yet to dig into your newer examples on this thread.
Oh, you are right. Let me double-check this. It indeed looks weird that they are so similar. I'll get back to you.
If you have a way to retrieve SMT2 files it also helps. For the particular case of the last example, it matters a lot which random seed I use. It often suggests that there is something to improve with regards to fairness. Across a larger set of benchmarks, there will be enough samples where the fairness or similar issues poke out.
Is there debug info that Z3 prints out that tells me which arithmetic solver it is running?
I use statistics and look for characteristics.
New solver (look for propagations-cheap)
(:added-eqs 2632
:arith-eq-adapter 163
:arith-assume-eqs 1
:arith-bound-propagations-cheap 1149
:arith-bound-propagations-lp 995
:arith-cheap-eqs 11
:arith-conflicts 1980
:arith-diseq 238
:arith-grobner-calls 9
:arith-horner-calls 249
:arith-horner-cross-nested-forms 243049
:arith-lower 2334
:arith-make-feasible 3863
:arith-max-columns 2372
:arith-max-rows 2006
:arith-nla-explanations 1975
:arith-nla-lemmas 1976
:arith-propagations 1149
:arith-upper 2262
old solver (look for arith-tableau-max-columns and similar):
(:added-eqs 113457
:arith-assert-diseq 7122
:arith-assert-lower 12130
:arith-assert-upper 14889
:arith-assume-eqs 3
:arith-bound-prop 32
:arith-branch-var 1
:arith-conflicts 413
:arith-eq-adapter 224
:arith-fixed-eqs 1924
:arith-gcd-tests 2
:arith-grobner 416
:arith-ineq-splits 1
:arith-max-min 8238
:arith-nonlinear-bounds 428
:arith-nonlinear-horner 8359
:arith-num-rows 45
:arith-offset-eqs 366
:arith-patches 1
:arith-pivots 3707
:arith-pseudo-nonlinear 292
:arith-row-summations 56690
:arith-tableau-max-columns 411
:arith-tableau-max-rows 45
Ok, I got it. So scatter plot generation is buggy for some reason. We haven't been using it that much recently, and I guess some of our recent updates broke it. Not sure. But you should totally disregard the scatter plots.
If you look at raw results in the table for this set: http://node0.smackbenchwebserver.smack.emulab.net/results.py?runset=runs%2Fexec_2020.12.19_13.15.50.232242_SoftwareSystems-AWS-C-Common-ReachSafety&runset=runs%2Fexec_2020.12.15_10.17.38.505765_SoftwareSystems-AWS-C-Common-ReachSafety#/table
you can notice that there are noticeable differences in runtimes, especially for the longer running queries.
I also did a sanity check based on your suggestion above, and we are indeed running the new arithmetic solver.
Sometimes taking shorter running examples that show average regressions is easier. For example these ones:
aws_byte_buf_append_dynamic_harness.ymltrue true 18.0 19.1 99.8 true 11.0 12.7 95.2 aws_priority_queue_remove_harness.ymltrue true 26.9 27.9 178 true 10.6 12.2 140 aws_priority_queue_push_harness.ymltrue true 24.4 25.3 166 true 10.5 12.0 145 aws_byte_buf_reset_harness.ymltrue true 18.5 19.6 92.1 true 10.5 12.4 92.1 aws_priority_queue_push_ref_harness.ymltrue true 20.8 21.9 160
Got it. If there is a particular one you are maybe interested in (or a few of them), I can generate the query for you.
Yes please, exactly these ones or 3-6 of them, such as:
aws_priority_queue_push_ref_harness.yml aws_byte_buf_reset_harness.yml aws_priority_queue_push_harness.yml
as SMTLIB2 commands, ....
For your piro.txt case: it is indeed fairness. Generally the new solver is much faster, but then gets trapped in some cases during non-linear lemma generation. It seems to indicate that we are not using inference rules with sufficient fairness. The example goes through if I remove the particular case for lemma generation manually in the code. (@levnach it is hitting the order_lemma_on_binomial_sign case, which uses the values of variables in the new lemmas. Since the values change between calls it may end up generating a large number of lemmas without yielding the floor to other procedures, such as tangent and nlsat).
@zvonimir. A different report exposed places where the new solver was missing propagation. Fixes to this helped. If you can provide perf regression repros then great, otherwise I would like to close this issue.
Sorry for the delay. Let me run our regressions on the latest nightly release.
@NikolajBjorner Quick question. I don't see a nightly release for Ubuntu. I am looking here: https://github.com/Z3Prover/z3/releases/tag/Nightly Am I missing something? Should I build it from source on Ubuntu!? Thanks!
it gets named glibc these days.
Ok, so I tried the nightly version. At this point I see very few difference between the two solvers. However, there are still some left.
I attached one such query (tmp.zip), which gets solved using the old solver (meaning, when (set-option :smt.arith.solver 2)
is set), while Z3 doesn't come back when the new solver is used.
Maybe it is a relatively easy fix. Thanks!!!
Here is another one with the same issue: jain.zip It is smaller and so it might be easier to debug.
Thanks, this is definitely much more approachable. It is still not too easy to diagnose these so it is a longer process. But given that you are not blocked (you can still use the old solver) I am taking a long-term perspective on this.
Jain.smt2 was solved fast for me in the new solver last version f4127bd6f3bd19079143ab73c4ab3779038a0cea. From the other side, tmp.smt2 was not.
It depends on smt.random_seed value. Some seeds are quick, others slow for (set-option :smt.arith.solver 6). Note also that the option is set inside the file. At least it is a much simpler test case than tmp.smt2.
Not sure if this is helpful or not, but this is the C code of the jain
benchmark: https://github.com/smackers/smack/blob/master/test/c/basic/jain_2_true.c
Maybe you can get a sense for what kind of operations there are in the benchmark.
There are 3 other jain
benchmarks (see https://github.com/smackers/smack/tree/master/test/c/basic), which are all very similar, and yet those are not timing out.
Thanks!
@levnach For Jain.smt2, the new solver ends up with an infeasible tableau
x + y + 1 + 1/2 z = 0
u + v + 1/2 + 1/2z = 0
where x, y, z, u, v are integers. It performs cuts on z and branch on x, y, u, v.
The tableau is infeasible since the first row implies that z has even parity, the second that z has odd parity.
So far it looks like the problem becomes solvable if the lp solver pivots into z. This happens on the successful run I see, but not for the unsuccessful random seeds. Determining whether adding fairness to the LP solver helps could be one avenue. A different is to revisit the notes https://github.com/Z3Prover/z3/blob/ba56bfa65640ff98133e1a04edf647c8b69bea71/src/math/lp/int_gcd_test.cpp#L32 on the accumulative gcd test. This would take care of scenarios like these.
Z3 4.8.8 solves this query instantly (as well as all previous releases as far as I can remember), while the new Z3 4.8.9 times out: pero.zip
This is pretty unexpected since the query is small and (to the best of my knowledge) does not contain features where we would expect to see performance regressions when updating releases (non-linear arithmetic, quantifiers, etc.).
Any suggestions? What are we maybe missing?
Thanks!