Adding constants during SMT encoding is slow

alberdingk-thijm commented 2 years ago

SMT encoding remains a major bottleneck for NV. I ran some diagnostics to get a better sense of where time is currently spent during encoding: after compiling NV with -g set (see the kirigami-optimization branch), I ran perf record --call-graph dwarf ./nv.debug -v -m benchmarks/SinglePrefix/FAT4/fat4Pol.nv and analyzed the results using speedscope.

Instances of Nv_smt.SmtUtils.add_constant_inner_4394 took 57% of total runtime for NV: looking more closely at this function, it is adding constants to the smt_env.const_decls set of constants. const_decls is a ConstantSet (i.e. a BatSet with constant keys), where keys c1, c2 are compared using BatString.numeric_compare c1.name c2.name (the names being unique strings).

The first question might be if it is necessary to compare strings numerically here: the standard String.compare seems sufficient to add constants by name to a set. The second is whether a BatSet is the most efficient data structure for this operation.

As another example, adding some diagnostic print statements to add_constant, fat4Pol.nv calls add_constant 41633 times, where 680 calls (1.63%) do nothing since the key is already present. Thus the majority of add_constant's time is spent adding these constants to a set with only a small number of cases where we use the set's features to avoid adding duplicates.

Now, where do we read from const_decls?

SmtOptimizations.propagate_eqs deletes constants we've removed during propagation. It filters the constants out if they are found in an auxiliary data structure of propagated variable remappings.
Smt.env_to_smt writes all constants out as a list and then concats them into a string for the SMT query (perhaps may be faster written as a fold over the set: would save an allocation).
SmtClassicEncoding.find_input_symbolics finds any constants which were used to create symbolic hypotheses and returns them as a list for use by encode_kirigami_solve (so this only happens when partitioning the network).
Some functions in SmtHiding (won't discuss these as I'm not familiar with that module).

As far as the first three functions go, none of this behaviour requires the constants to be a set: all these operations require traversing all the constants anyway (the find_input_symbolics implementation also has an obvious optimization which is that we know which symbolics these are before calling encode_kirigami_solve, so we can save a linear traversal of the set by computing that ahead of time).

The only benefit the set gives us, as far as I can tell, is that it lets us catch duplicate keys in log-time (since the implementation of BatSet is a balanced binary tree). Is there any way we can avoid trying to add these duplicate keys to begin with, and then use a faster data structure to store the constants?

alberdingk-thijm commented 2 years ago

Looking at commit history, I can see that I am the one to blame for the use of numeric_compare as a way to fix the handling of find_input_symbolics's search for constants.

I will see about reverting that change and using the proposed optimization for find_input_symbolics to see if that addresses the issue.

nickgian commented 2 years ago

I don’t remember the details of the encoding unfortunately.

EDIT: rephrasing what i wrote before.

One alternative based on what you wrote in the issue is to use the fresh integer id associated with each constant, and then all processing is done using this id, except for maybe the printing of the query where you want to map the id back to its full name. So you would need a map from ints to strings. But then comparing integers should be cheaper than comparing strings (btw what does numeric_compare do?), and if you really want to squeeze every last bit of performance you can use a very large array instead of a set. I think that might work?

In your performance analysis can you see how much time is spent in string comparisons, especially ones that stem from set operations?

EDIT 2: so you would have to change the create_name function to use the Var.id (or whatever we call it) instead of the string, and then associate this id to the string currently used.

alberdingk-thijm commented 2 years ago

Yeah I think integer IDs would probably be fine for this. BatString.numeric_compare is a comparison that respects the ordering of numbers, so once it hits a numeric part of a string, it compares the two integers. It does a good bit of converting to manage all this, and speedscope reports that BatString.fun_2772 (which I'm pretty sure is the call to numeric_compare) takes 2.61s (43% of runtime), so it seems like the obvious culprit.

I'll switch it back to a regular String.compare to start, and then see how much more performance we can squeeze out of it if I use integer IDs. Thanks for weighing in!

NetworkVerification / nv

Adding constants during SMT encoding is slow #73