NetworkVerification / nv

A Framework for Modeling and Analyzing Network Configurations
MIT License
31 stars 2 forks source link

Possible file descriptor race condition when running kirigami with many (>500) partitions #78

Open alberdingk-thijm opened 2 years ago

alberdingk-thijm commented 2 years ago

For some examples, particularly those in benchmarks/SinglePrefix/Random and examples/TopologyZoo/kdl/, NV will randomly fail to parse an SMT reply if the input has a very large number of partitions it creates. The output looks something like the following:

...
Z3 stats:
  :memory       21.09
 :time         0.01)
Slice 0 took: 0.213472 secs to complete
Encoding network took: 0.021362 secs to complete
Compiling query took: 0.005419 secs to complete

error: Solver error: (error "line 5669 column 41: Invalid constant declaration: unknown sort 'Boo'")
Fatal error: exception Nv_lang.Console.Error("Solver error: (error \"line 5669 column 41: Invalid constant declaration: unknown sort 'Boo'\")")
Raised at Nv_lang__Console.error in file "src/lib/lang/Console.ml", line 22, characters 2-19
Called from Nv_smt__SmtUtils.parse_reply in file "src/lib/smt/SmtUtils.ml", line 638, characters 9-45
Called from Nv_smt__SmtKirigami.solve.solve_aux.get_ret in file "src/lib/smt/SmtKirigami.ml", line 42, characters 18-39
Called from Nv_smt__SmtKirigami.solve.solve_aux in file "src/lib/smt/SmtKirigami.ml", line 62, characters 9-51
Called from Nv__Main_defs.run_smt_classic_aux.get_answer in file "src/lib/Main_defs.ml", line 57, characters 10-74
Called from Nv_utils__Profile.time_profile_absolute in file "src/lib/utils/Profile.ml", line 11, characters 12-16
Called from Nv__Main_defs.run_smt_classic_aux.solve_slices in file "src/lib/Main_defs.ml", line 79, characters 8-169
Called from BatList.map.loop in file "src/batList.mlv", line 237, characters 28-33
Called from BatList.map in file "src/batList.mlv", line 240, characters 4-12
Called from Main.main_func in file "src/exe/Main.ml", line 24, characters 9-45
Called from Nv_utils__Profile.time_profile_absolute in file "src/lib/utils/Profile.ml", line 11, characters 12-16
Called from Main.main in file "src/exe/Main.ml", line 35, characters 11-76

Note the error line: it seems as though Z3 has been given part of an SMT query but the query was cut off. Looking at the code for SolverUtil.ml, which manages the creation of Z3 processes, we can see that it's doing some work setting up a pipe and creating a process to run Z3, which happens each time we call Smt.solver. Normally, this only happens once, but the kirigami code will happily call Smt.solver as many times as there are partitions, which can of course be hundreds or thousands of times.

We don't seem to report any Unix errors when doing this, which is a problem, as it makes it hard to see why this function might fail. That being said, given that I am seeing this error only when using many partitions, and not all the time (sometimes NV completes without any issues), I'm guessing we're seeing a race condition where the file descriptor of one process gets closed by accident, whether because we've hit the maximum number of file descriptors or because it was reassigned and closed by accident. I don't know much about how OCaml and Batteries' Unix library works, but I'll start off by trying to get more error reporting to see if this might indeed be the source of the bug.

alberdingk-thijm commented 2 years ago

Looks like this is related to #15, and may need to be addressed by closing descriptors as we go, as mentioned back around this commit. @DKLoehr do we need to re-add code somewhere to call close_solver?