`copilot`: Results of `what4` example do not match expectation

ivanperez-keera commented 1 month ago

The example in Propositional states that some of the definitions are invalid (cannot be proven), especially when defined by induction. However, that is not the case, at least not anymore. A run of that example, produces:

*Main> main
Example 1: valid
Example 2: invalid
Example 3: valid
Example 4: valid
Example 5: valid
Example 6: valid
Example 7: valid

The only one that is invalid is one that is false.

RyanGlScott commented 1 month ago

I believe that this is a leftover from back before Copilot.Theorem.What4 was using its old heuristics for determining how to handle inductive-style proofs. In 4ae3f249fbf8c37c0909046150ff341d432bd042, Copilot.Theorem.What4 was re-written to use a k-induction–based heuristic, which is more clever and can handle more inductive proofs out of the box, including some of the Propositional examples.

That being said, Copilot.Theorem.What4 is still using heuristics, and it's possible to fool the heuristics if you try hard enough. Here is a modification to Example 3 that will cause it to become invalid instead of valid.

diff --git a/copilot/examples/what4/Propositional.hs b/copilot/examples/what4/Propositional.hs
index 75dcf55f..05bc7da9 100644
--- a/copilot/examples/what4/Propositional.hs
+++ b/copilot/examples/what4/Propositional.hs
@@ -22,7 +22,7 @@ spec = do
   -- An inductively defined flavor of true, which requires induction to prove,
   -- and hence is found to be invalid by the SMT solver (since no inductive
   -- hypothesis is made).
-  let a = [True] ++ a
+  let a = [True] ++ ([True] ++ ([True] ++ a))
   void $ prop "Example 3" (forAll a)

   -- An inductively defined "a or not a" proposition, which is unprovable by

(Note that let a = [True] ++ ([True] ++ ([True] ++ a)) is the result of appending multiple streams where each stream has a history of length 1. It shouldn't be confused with let a = [True, True, True] ++ a, which is a single stream with a history of length 3.)

Perhaps we should update the comments in Propositional, and consider including the more complicated example above as something that the heuristics cannot handle out of the box?

ivanperez-keera commented 1 month ago

let a = [True] ++ ([True] ++ ([True] ++ a)) is the result of appending multiple streams where each stream has a history of length 1. It shouldn't be confused with let a = [True, True, True] ++ a, which is a single stream with a history of length 3.

Should those be different?

RyanGlScott commented 1 month ago

The two streams are equivalent in terms of behavior, but Copilot stores each stream differently in its internal representation. (One could imagine an optimization that turns the let a = [True] ++ ([True] ++ ([True] ++ a)) into let a = [True, True, True] ++ a, but Copilot doesn't currently perform such an optimization.)

Copilot.Theorem.What4's heuristic is sensitive to this internal representation, as it uses the maximum history length of all streams in the specification to determine how to much work it needs to do an inductive proof. Generally speaking, this means that if you have streams with a longer history, the more work the heuristic will do (and the more likely it is that the proof will cover all of the necessary base cases).

ivanperez-keera commented 1 month ago

Description

The example included in copilot/examples/what4/Propositional.hs includes comments indicating the expectations for each of the statements that can/cannot be proven with Z3. Those comments are incorrect wrt. the current implementation.

Type

Bug: comments in example are incorrect.

Additional context

None.

Requester

Ivan Perez

Method to check presence of bug

Running the example if copilot/examples/what4/Propositional.hs produces an output that does not match the comments in the code. For example, the third example reads: https://github.com/Copilot-Language/copilot/blob/068c06dd7ab6e900e2e8728ecb1c3b6e94ba9ccb/copilot/examples/what4/Propositional.hs#L22-L26

but, when running the file, the output is:

*Main> main
Example 1: valid
Example 2: invalid
Example 3: valid
Example 4: valid
Example 5: valid
Example 6: valid
Example 7: valid

Expected result

The comments in and the output of running the file copilot/examples/what4/Propositional.hs match.

Desired result

The comments in and the output of running the file copilot/examples/what4/Propositional.hs match.

Proposed solution

Modify examples so that they are consistent with output, potentially duplicating them so that basic cases, which are handled by copilot-theorem, are shown separately from those that it cannot prove valid.

Further notes

Commit https://github.com/Copilot-Language/copilot/commit/4ae3f249fbf8c37c0909046150ff341d432bd042 changed how Copilot.Theorem.What4 handles inductive-style proofs, rendering these comments incorrect.

ivanperez-keera commented 1 month ago

Change Manager: Confirmed that the issue exists.

ivanperez-keera commented 1 month ago

Technical Lead: Confirmed that the issue should be addressed.

ivanperez-keera commented 1 month ago

Technical Lead: Issue scheduled for fixing in Copilot 4.1.

Fix assigned to: @RyanGlScott .

RyanGlScott commented 1 month ago

Implementor: Solution implemented, review requested.

ivanperez-keera commented 2 weeks ago

Change Manager: Verified that:

Solution is implemented:
- [X] The code proposed compiles and passes all tests. Details: Build log: https://app.travis-ci.com/github/Copilot-Language/copilot/builds/272793043
- [X] The solution proposed produces the expected result. Details: The examples have changed, so the evaluation is performed by visual inspection. Running the examples using the image below produces an output that matches the expectations described in comments in the file:
```
FROM ubuntu:focal
```
ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update

RUN apt-get install --yes libz-dev RUN apt-get install --yes git

RUN apt-get install --yes wget RUN mkdir -p $HOME/.ghcup/bin RUN wget https://downloads.haskell.org/~ghcup/0.1.19.2/x86_64-linux-ghcup-0.1.19.2 -O $HOME/.ghcup/bin/ghcup

RUN chmod a+x $HOME/.ghcup/bin/ghcup ENV PATH=$PATH:/root/.ghcup/bin/ ENV PATH=$PATH:/root/.cabal/bin/ RUN apt-get install --yes curl RUN apt-get install --yes gcc g++ make libgmp3-dev RUN apt-get install --yes pkg-config RUN apt-get install --yes z3

SHELL ["/bin/bash", "-c"]

RUN ghcup install ghc 9.4 RUN ghcup install cabal 3.2 RUN ghcup set ghc 9.4.8 RUN cabal update

CMD git clone $REPO \ && cd $NAME \ && git checkout $COMMIT \ && cabal v1-sandbox init \ && cabal v1-install alex happy --constraint='happy<2' \ && cabal v1-install copilot**/ \ && cabal v1-exec -- runhaskell copilot/examples/what4/Propositional.hs
```
Command (substitute variables based on new path after merge):
```
$ docker run -e "REPO=https://github.com/GaloisInc/copilot-1" -e "NAME=copilot-1" -e "COMMIT=b71e159738b215bb35bf5fbd1e073d4d25c30d00" -it copilot-verify-535
[X] Implementation is documented. Details: The example modified includes documentation.
[X] Change history is clear.
[X] Commit messages are clear.
[X] Changelogs are updated.
[X] Examples are updated. Details: The change is to an example.
[X] Required version bumps are evaluated. Details: No bump needed. Change is to an example.

ivanperez-keera commented 2 weeks ago

Change Manager: Implementation ready to be merged.

Copilot-Language / copilot

`copilot`: Results of `what4` example do not match expectation #535