NetworkVerification / nv

A Framework for Modeling and Analyzing Network Configurations
MIT License
31 stars 2 forks source link

Add -timeout and -parallel-timeout command-line flags #72

Closed alberdingk-thijm closed 3 years ago

alberdingk-thijm commented 3 years ago

Sometimes, when profiling and benchmarking NV, a file takes a very, very long time to run. Enough time that you would rather just give up after some number of seconds have elapsed. Right now, the way to do that is by calling nv from within another program, say time or a Python script, that will kill it once the timeout is hit. Usually, this is totally fine to do, and pretty flexible.

As it so happens, I have a situation in mind where this approach doesn't quite work as neatly as I'd hope, which is when providing a timeout for a network we partition. Since partitioning is not set up to run in parallel, we analyze each network sequentially and hence may time out part way through that process, despite the fact that, in a magical, perfect world where we had as many threads for NV as we had partitions, we could run everything fully in parallel and potentially not hit the timeout.

What I'd like to do is add a -timeout flag to NV that, in its most basic form, takes an integer T and just reports a timeout if T seconds have elapsed since the program began, along with what stage NV was in when the timeout occurred.

I would also like a second complementary boolean flag, -parallel-timeout, which tells NV that, when running Kirigami, after hitting the entry point of Main_defs.run_smt_classic_aux at K seconds, should essentially multiply the remaining time until timeout T - K by the number of partitions N, giving us a total timeout length of K + (T - K) * N. Since I don't know K before running nv, integrating this feature into the tool directly allows me to handle the situation described above.

I'll close this issue once I've added the two flags: happy to handle all this myself, but wanted to add an issue in case this had unintended consequences, folks had suggestions, etc.

DKLoehr commented 3 years ago

Sounds reasonable.

alberdingk-thijm commented 3 years ago

As of commit bbf585eb, we now have a working -t N option which does something different from what is shown above: act as an interface to Z3's -t N "soft timeout" option. This means that when z3 starts handling a query by using (get-sat), if N seconds pass while it is processing that query, it will give up and return "unknown". The code then checks if the reason for the returned unknown was that the query was canceled, meaning that it timed out (rather than because the query is unsolvable). Because we handle this timeout per query, it has a nice "bonus" feature which is that each partitioned query is given that timeout as well, since we reuse the existing z3 process for all queries. Hence, we get the following behavior for the shortest path 10-pod fattree example, when vertically partitioned:

> ./nv -v -m -t 3 -k benchmarks/SinglePrefix/FAT10/sp10-v-x.nv
Processing benchmarks/SinglePrefix/FAT10/sp10-v-x.nv
Processing examples/utils.nv
Inlining took: 0.043449 secs to complete
Map unrolling took: 0.330889 secs to complete
Inlining took: 0.027739 secs to complete
Unbox options took: 0.077001 secs to complete
Flattening Tuples took: 0.162809 secs to complete
Partitioning took: 1.196501 secs to complete
Encoding network took: 21.391267 secs to complete
Compiling query took: 0.124481 secs to complete
Z3 stats:
  :memory                         94.35
 :time                           2.71)
Slice 0 took: 24.646344 secs to complete
Encoding network took: 22.070904 secs to complete
Compiling query took: 0.113780 secs to complete
Z3 stats:
  :memory                         95.34
 :time                           3.01)
Slice 0 took: 25.635070 secs to complete
Partition 0
No counterexamples found
Partition 1
Verification failed: Z3 timed out after 3 second(s)
Entire Program took: 52.828563 secs to complete

The tl;dr is that partitions can time out independently of one another, and this will be reported separately.

As far as the other concerns mentioned in this issue go, this solves the use case for me for now. If we find that we want different timeout behaviours, we can add those in at a later date.