Copilot-Language / copilot

A stream-based runtime-verification framework for generating hard real-time C code.
http://copilot-language.github.io
658 stars 55 forks source link

`copilot`: Results of `what4` example do not match expectation #535

Closed ivanperez-keera closed 2 weeks ago

ivanperez-keera commented 1 month ago

The example in Propositional states that some of the definitions are invalid (cannot be proven), especially when defined by induction. However, that is not the case, at least not anymore. A run of that example, produces:

*Main> main
Example 1: valid
Example 2: invalid
Example 3: valid
Example 4: valid
Example 5: valid
Example 6: valid
Example 7: valid

The only one that is invalid is one that is false.

RyanGlScott commented 1 month ago

I believe that this is a leftover from back before Copilot.Theorem.What4 was using its old heuristics for determining how to handle inductive-style proofs. In 4ae3f249fbf8c37c0909046150ff341d432bd042, Copilot.Theorem.What4 was re-written to use a k-induction–based heuristic, which is more clever and can handle more inductive proofs out of the box, including some of the Propositional examples.

That being said, Copilot.Theorem.What4 is still using heuristics, and it's possible to fool the heuristics if you try hard enough. Here is a modification to Example 3 that will cause it to become invalid instead of valid.

diff --git a/copilot/examples/what4/Propositional.hs b/copilot/examples/what4/Propositional.hs
index 75dcf55f..05bc7da9 100644
--- a/copilot/examples/what4/Propositional.hs
+++ b/copilot/examples/what4/Propositional.hs
@@ -22,7 +22,7 @@ spec = do
   -- An inductively defined flavor of true, which requires induction to prove,
   -- and hence is found to be invalid by the SMT solver (since no inductive
   -- hypothesis is made).
-  let a = [True] ++ a
+  let a = [True] ++ ([True] ++ ([True] ++ a))
   void $ prop "Example 3" (forAll a)

   -- An inductively defined "a or not a" proposition, which is unprovable by

(Note that let a = [True] ++ ([True] ++ ([True] ++ a)) is the result of appending multiple streams where each stream has a history of length 1. It shouldn't be confused with let a = [True, True, True] ++ a, which is a single stream with a history of length 3.)

Perhaps we should update the comments in Propositional, and consider including the more complicated example above as something that the heuristics cannot handle out of the box?

ivanperez-keera commented 1 month ago

let a = [True] ++ ([True] ++ ([True] ++ a)) is the result of appending multiple streams where each stream has a history of length 1. It shouldn't be confused with let a = [True, True, True] ++ a, which is a single stream with a history of length 3.

Should those be different?

RyanGlScott commented 1 month ago

The two streams are equivalent in terms of behavior, but Copilot stores each stream differently in its internal representation. (One could imagine an optimization that turns the let a = [True] ++ ([True] ++ ([True] ++ a)) into let a = [True, True, True] ++ a, but Copilot doesn't currently perform such an optimization.)

Copilot.Theorem.What4's heuristic is sensitive to this internal representation, as it uses the maximum history length of all streams in the specification to determine how to much work it needs to do an inductive proof. Generally speaking, this means that if you have streams with a longer history, the more work the heuristic will do (and the more likely it is that the proof will cover all of the necessary base cases).

ivanperez-keera commented 1 month ago

Description

The example included in copilot/examples/what4/Propositional.hs includes comments indicating the expectations for each of the statements that can/cannot be proven with Z3. Those comments are incorrect wrt. the current implementation.

Type

Additional context

None.

Requester

Method to check presence of bug

Running the example if copilot/examples/what4/Propositional.hs produces an output that does not match the comments in the code. For example, the third example reads: https://github.com/Copilot-Language/copilot/blob/068c06dd7ab6e900e2e8728ecb1c3b6e94ba9ccb/copilot/examples/what4/Propositional.hs#L22-L26

but, when running the file, the output is:

*Main> main
Example 1: valid
Example 2: invalid
Example 3: valid
Example 4: valid
Example 5: valid
Example 6: valid
Example 7: valid

Expected result

The comments in and the output of running the file copilot/examples/what4/Propositional.hs match.

Desired result

The comments in and the output of running the file copilot/examples/what4/Propositional.hs match.

Proposed solution

Modify examples so that they are consistent with output, potentially duplicating them so that basic cases, which are handled by copilot-theorem, are shown separately from those that it cannot prove valid.

Further notes

ivanperez-keera commented 1 month ago

Change Manager: Confirmed that the issue exists.

ivanperez-keera commented 1 month ago

Technical Lead: Confirmed that the issue should be addressed.

ivanperez-keera commented 1 month ago

Technical Lead: Issue scheduled for fixing in Copilot 4.1.

Fix assigned to: @RyanGlScott .

RyanGlScott commented 1 month ago

Implementor: Solution implemented, review requested.

ivanperez-keera commented 2 weeks ago

Change Manager: Verified that:

ivanperez-keera commented 2 weeks ago

Change Manager: Implementation ready to be merged.