Type check falsely succeeds with --n_cores x where x > 1

SECtim commented 5 years ago

First a few version infos: Z3 version 4.4.1

fstar.exe --version
F* 0.9.6.0
platform=Linux_x86_64
compiler=OCaml 4.05.0
date=2018-05-17T23:34:58+00:00
commit=743819b90

Here's what I did:

Type checked some fst file which opens some modules and used --n_cores 4 as well as an --include ... option on fstar.
Type check succeeds (while it obviously shouldn't from looking at the code!).
Did exactly the same with --n_cores 1 (explicitly and implicitly - didn't notice any difference).
Type check fails (as expected).

The wrong behavior is reproducible, also with --n_cores 2 and 3 (probably anything greater than 1).

Unfortunately, I can't just post the actual code I was working on, but I'll try to get some kind of minimal working example to illustrate the problem and post it here. Maybe interesting: The part of the code expected to fail is the "definition" of a lemma (such as: my_lemma a b c = ()).

Here's the command line log (with some names changed, but you get the idea):

$ fstar.exe --include ../../modules --n_cores 4 System.fst
Verified module: System (5760 milliseconds)
All verification conditions discharged successfully

$ fstar.exe --include ../../modules --n_cores 3 System.fst
Verified module: System (5808 milliseconds)
All verification conditions discharged successfully

$ fstar.exe --include ../../modules --n_cores 2 System.fst
Verified module: System (5743 milliseconds)
All verification conditions discharged successfully
Thread 5 killed on uncaught exception Sys_error("Bad file descriptor")
Raised by primitive operation at file "pervasives.ml", line 403, characters 7-32
Called from file "src/batInnerIO.ml", line 381, characters 16-41
Called from file "src/batInnerIO.ml", line 215, characters 16-34
Called from file "src/batInnerIO.ml", line 311, characters 12-26
Called from file "src/batInnerIO.ml", line 317, characters 4-10
Called from file "src/basic/ml/FStar_Util.ml", line 217, characters 14-39
Called from file "thread.ml", line 39, characters 8-14

$ fstar.exe --include ../../modules --n_cores 1 System.fst
<dummy>(0,0-0,0): (Warning 282) Expected Z3 commit hash "1f29cebd4df6", got "Z3 version 4.4.1"
Please download the version of Z3 corresponding to your platform from:
https://github.com/FStarLang/binaries/tree/master/z3-tested
and add the bin/ subdirectory into your PATH

System.fst(108,47-108,49): (Error 19) could not prove post-condition (see also System.fsti(120,9-120,67))
Verified module: System (3750 milliseconds)
1 error was reported (see above)

mtzguido commented 5 years ago

Here's a minimal example :)

module F
let _ = assert False

$ ./bin/fstar.exe F.fst  --n_cores 7
Verified module: F (59 milliseconds)
All verification conditions discharged successfully

I also get those crashes if I tweak the number a bit. So yeah, this is badly broken. Thanks a lot for reporting it.

SECtim commented 5 years ago

Well, that was fast :) Thanks for the example, I'm not that familiar with F* yet, so I probably wouldn't have started with such a small example.

nikswamy commented 5 years ago

--n_cores is not part of any of our regression suites. So, sadly, I'm not surprised it has regressed to the point of unsoundness.

We tend to exploit parallelism in large projects by spawning many F processes in parallel, rather than having each F process spawn multiple Z3's.

Is it worth retaining and maintaining this feature?

@wintersteiger, you may be interested, since you also worked on this.

catalin-hritcu commented 5 years ago

Not sure if maybe @shaolintl would still be interested in looking at this.

catalin-hritcu commented 5 years ago

Tomer is busy with teaching at least until spring, so it would be great if someone picked up maintaining this feature that used to work fine a while back. @wintersteiger maybe?

This feature should be good use for interactive proof development, where make -j is little help. I know that at least @beurdouche was using it at some point.

catalin-hritcu commented 5 years ago

We discussed this in today's F* call, and the main perceived problem with this feature is that it could add nondeterminism in the SMT proving process, so a proof that works with --n_cores k can fail with --n_cores k' or without --n_cores. This is in particular related to the flakiness of push and pop in Z3. The consensus in our call was that if no one is using this feature and no one has the cycles to maintain it we should remove it.

catalin-hritcu commented 5 years ago

(even if it's sad for all the work that went into implementing this feature: #146)

wintersteiger commented 5 years ago

I suppose we could force ourselves to actually use it by default in interactive mode, to see whether it's really fit for everyday use. Of course, that means fixing this bug first. If it's as easy as I suspect it is, I can probably fix it soon, but overall I don't have much time to donate at the moment.

I suspect that we would have to fix some of the non-determinism to make it really usable. That may be quite a lot of work, although perhaps the same behaviour as with --z3reset would be acceptable. The non-determinism in which verification condition fails could be resolved by artificially waiting for as many results as necessary to identify the earliest failing verification condition.

nikswamy commented 5 years ago

I dread reports of proofs succeeding in interactive mode with --n_cores and failing on the command line without it ... or vice versa.

I wonder how important this is for interactive mode anyway: typically, we feed one top-level term at a time, which typically produces (with two_phase_tc) a single SMT query. So, there's no opportunity for parallelism in this usage.

It may improve verification times when feeding more than one top-level term at a time interactively, but the variation in Z3 queries that it would produce makes me think that the maintenance cost of this feature will be very high.

catalin-hritcu commented 5 years ago

If this was easy to bring back, I could try to run some experiments and report on the speedup. Looking at #146, the only numbers we had back then were an 1.5x speedup on f-omega.fst using 4 cores.

wintersteiger commented 5 years ago

For this type of question it would be awesome if we instrumented some of our F*/Emacs-modes to keep a history of all intermediate queries for some period of time, as a testable and benchmarkable representation of interactive proof work.

mtzguido commented 5 years ago

So what do we do for the time being? Disable it, deprecate it with a warning, only allow it when also passing --experimental (or something)?

Regarding the non-determinism: it would go away if we spawned a new process per query, right? I know that's slow, but isn't that mostly due to the time it takes Z3 to process the context from previous modules? Can't that be somehow cached at the Z3 level, perhaps into a file? (Or by forking the process, maybe?) If something like this could work, it might improve stability even without parallelism.

In higher-level terms: if this feature cannot be made absolutely bulletproof by design, I would argue against it.

nikswamy commented 5 years ago

I'm very pessimistic about getting this feature working to a point where we can eliminate the nondeterminism from small variations in SMT queries fed to Z3. I am in favor of removing the feature.

catalin-hritcu commented 5 years ago

I'm okay with this feature getting removed for eliminating the current inconsistency and (alleged) nondeterminism, as well as for gaining some maintainability.

Most proof assistants have long been reluctant to parallel processing, but it seems that Isabelle's great user experience based on parallel processing (https://www.sciencedirect.com/science/article/pii/S1571066112000291) is where most are heading now, including for instance Coq: https://coq.inria.fr/refman/addendum/parallel-proof-processing.html I think F* will eventually get there too, even if the process is not easy or monotonic.

mtzguido commented 5 years ago

@catalin-hritcu I think this issue still stands (the PR wasn't merged).

FStarLang / FStar

Type check falsely succeeds with --n_cores x where x > 1 #1538