Open GinoGiotto opened 3 weeks ago
When I first tried to replicate your results, I got errors about missing theorems, and the issue was that the set.lisp
file was out of date relative to the set.mm
file. ./build.sh
will avoid overwriting set.lisp
if it exists but this means it can get out of date. If in doubt, delete the build
and tmp
folders and run ./make.sh
again.
hammer_prop
After the above, I was able to run it successfully resulting in:
hammer_prop $p |- ( ( -. ch -> ( ( -. th -> ta ) -> ch ) ) -> ( th -> ch ) ) $=
( wn wi notnotr a1d frege37 ja ) ADZBDCEAEZBAEZJDZABAFGBCAHI $.
which metamath-knife agrees is a correct proof.
hammer_walsh2
hammer_walsh2 $p |- ( ps -> ( ( ph -> ( ps -> ch ) ) -> ( ( -. ch -> ( ( -. th -> ta ) -> ph ) ) -> ( th -> ch ) ) ) ) $=
( wi wn ax-frege31 a1d a1i con1i pm2.21 pm2.5g wl-luk-imtrdi merlem4 expt
jad ) BABCFZCGZDGZEFZAFZFZDCFZFZAGZUEFZBUFSUBUDSGZUDFZUFUHCDCHIJUFUBUAGZU
DUFUAAUJUJUJFZUFUJUAUAHKJAUJLQUDUADCEMKNQJBRUCTFGZCFZUECULBOUCDCPNQ $.
produced in 1.4 min. In general, these things can fail or timeout and this product comes with no warranty. It's certainly not a guarantee that it will be able to prove propositional logic problems, because they are being given to prover9 in heavily encoded form along with a ton of other things.
hammer_times3i
I was able to replicate the wrong proof. This should not happen. It appears to be a result of the argument reordering in expressions such as co
(that is, ( A F B )
), which have their arguments in the order A, F, B
in the encoding which is sent to the SMT solver but are in the order A, B, F
in metamath due to the order of $f declarations. You can see the result of this in the proof display of metamath-exe:
1 cA=cA $f class A
2 cB=cmul $a class x.
3 cA=c2 $a class 2
4 cB=caddc $a class +
5 cF=c1 $a class 1
6 cF=co $a class ( 2 1 + )
7 cA=co $a class ( A ( 2 1 + ) x. )
8 cA=cA $f class A
9 cB=cmul $a class x.
10 cF=c2 $a class 2
11 11:cA=co $a class ( A 2 x. )
12 cB=caddc $a class +
13 cA=cA $f class A
14 cB=cmul $a class x.
15 cF=c1 $a class 1
16 16:cF=co $a class ( A 1 x. )
17 cB=co $a class ( ( A 2 x. ) ( A 1 x. ) + )
18 cA=cA $f class A
19 cB=caddc $a class +
20 cF=cA $f class A
21 21:cA=co $a class ( A A + )
22 cB=caddc $a class +
23 cF=cA $f class A
24 cC=co $a class ( ( A A + ) A + )
25 cA=cA $f class A
26 cB=c2 $a class 2
27 cC=c1 $a class 1
28 axi.1=hammer_times3i.1 $e |- A e. CC
29 axi.2=2cn $p |- 2 e. CC
30 axi.3=ax-1cn $a |- 1 e. CC
31 eqtri.1=adddii $p |- ( A x. ( 2 + 1 ) ) = ( ( A x. 2 ) + ( A
x. 1 ) )
32 cA=11 $a class ( A 2 x. )
33 cB=21 $a class ( A A + )
34 cC=16 $a class ( A 1 x. )
35 cD=cA $f class A
36 cF=caddc $a class +
37 cA=c2 $a class 2
38 cB=cA $f class A
39 cC=21 $a class ( A A + )
40 axi.1=2cn $p |- 2 e. CC
41 axi.2=hammer_times3i.1 $e |- A e. CC
42 cA=cA $f class A
43 2timesi.1=hammer_times3i.1 $e |- A e. CC
44 mulcomli.3=2timesi $p |- ( 2 x. A ) = ( A + A )
45 oveq1i.1=mulcomli $p
46 cA=c1 $a class 1
47 cB=cA $f class A
48 cC=cA $f class A
49 axi.1=ax-1cn $a |- 1 e. CC
50 axi.2=hammer_times3i.1 $e |- A e. CC
51 cA=cA $f class A
52 axi.1=hammer_times3i.1 $e |- A e. CC
53 mulcomli.3=mulid2i $p |- ( 1 x. A ) = A
54 oveq12i.2=mulcomli $p |- ( A x. 1 ) = A
55 eqtri.2=oveq12i $p
56 hammer_times3i=eqtri $p
hammer_axun
This one seems to be a bug in metamath-knife (which is being used as a library here). The normal-form output is
wph cv wph cv wcel vy vz vw vz cv vw cv wcel vw cv vx cv wcel wa vz cv vy
cv wcel wi wal wal wex wph wph wel wph cv wph cv wcel vy vz vw vz cv vw
cv wcel vw cv vx cv wcel wa vz cv vy cv wcel wi wal wal wex wph elirrv
pm2.21i ax-mp $.
which can immediately be seen to have an error since wph cv
is applying cv
to wph
which makes no sense since ph
is a wff and cv
turns setvar
into class
. Debugging the input to metamath-knife reveals something like:
0 := vw[]
1 := vz[]
2 := cv[1]
3 := cv[0]
4 := wcel[2, 3]
5 := vx[]
6 := cv[5]
7 := wcel[3, 6]
8 := wa[4, 7]
9 := wph[]
10 := cv[9]
11 := wcel[10, 3]
so cv
is being correctly applied to vw
(that is, one of the dummy variables).
Actually I take it back regarding hammer_axun
, the input to metamath-knife is incorrect:
0 := vy[]
1 := vz[]
2 := vw[]
3 := cv[1]
4 := cv[2]
5 := wcel[3, 4]
6 := vx[]
7 := cv[6]
8 := wcel[4, 7]
9 := wa[5, 8]
10 := cv[0]
11 := wcel[3, 10]
12 := wi[9, 11]
13 := wal[2, 12]
14 := wal[1, 13]
15 := wex[0, 14]
16 := wph[]
17 := cv[16]
18 := wcel[17, 17]
19 := wn[18]
20 := elirrv[16]
21 := wi[18, 15]
22 := pm2.21i[18, 15, 20]
23 := wel[16, 16]
24 := ax-mp[18, 15, 23, 22]
The incorrect application is step 17 here, which gets reordered to be first because of post-order traversal.
When I first tried to replicate your results, I got errors about missing theorems, and the issue was that the set.lisp file was out of date relative to the set.mm file. ./build.sh will avoid overwriting set.lisp if it exists but this means it can get out of date. If in doubt, delete the build and tmp folders and run ./make.sh again.
Ok, I deleted the build
and tmp
folders and ran ./make.sh
again. After this, I tried hammer_prop
and mm-hammer gave me a different proof than yours but a correct one:
hammer_prop $p |- ( ( -. ch -> ( ( -. th -> ta ) -> ch ) ) -> ( th -> ch ) ) $= ( wn wi pm2.21 con4d frege37 ja ) ADZBDZCEAEZBAEZJDZABJKFGBCAHI $.
Then I tried hammer_walsh2
, but it gave me the same error message as before. After, I tried hammer_walsh3
as well, and this time it gave me a wrong proof:
hammer_walsh3 $p |- ( ph -> ( ( -. ps -> ( ( -. ch -> th ) -> ( ph -> ta ) ) ) -> ( ( ta -> ps ) -> ( ch -> ps ) ) ) ) $= ( wi wn bj-0 com3l ) EBFZABGCGDFAEFFFZCBFZEBAKLFFZHI $.
If I understand correctly, to make sure that I get a good result I should delete the build
and tmp
folders and run ./make.sh
each time before requesting a theorem. However this is so incredibly slow, not only the ./make.sh
takes a while to complete, but even deleting the build
folder takes some time as it contains almost a hundred thousand files.
Since these generated proofs seem to be incorrect for different reasons, it might be useful to automatically verify them after they are produced. If a proof is correct, then mm-hammer could output: "The above proof has been verified". If it's incorrect it could output: "The above proof is incorrect, report it on https://github.com/digama0/mm-hammer/issues". It might also be worth making a script to prove 100 random theorems from set.mm and check (when they are produced) whether their generated proofs are correct.
No, you do not need to run make.sh
every time, only when set.mm has changed, or you want to update mm-hammer. If you don't run it then it will only be able to find theorems existing in the version of set.mm it was trained on, which is probably fine for most purposes.
There is one specific bug I am tracking down (in metamath-knife) that will fix the issues with wrong proofs. I don't think it is necessary to add runtime validation since it should be correct by construction. Reporting bugs via the issue tracker like this is sufficient. (For the future though, please don't report 4 bugs on one issue. It makes it difficult to track things.)
No, you do not need to run make.sh every time, only when set.mm has changed, or you want to update mm-hammer.
Got it, thanks.
(For the future though, please don't report 4 bugs on one issue. It makes it difficult to track things.)
Ok, sorry for that. My default philosophy is: "before building a case, make sure to collect enough evidence", but I unnecessarily overdid it here.
I followed the instructions in the README file: I made sure I had Rust and Lisp installed, I executed
./make.sh
, it returned a long stream of text, and the example theoremhlopdNEW
worked as expected. All good.However, apart from
hlopdNEW
, most other statements either generate incorrect proofs or fail to deliver a proof at all. Below, I show a few examples:The statement
*_axun
is another way of expressing the axiom of union. This formulation is not present in set.mm, so that single-step proofs are avoided. The proof above is provided by me by hand and it is approved by the verifiers. The proof below is generated by mm-hammer and it fails the verifiers:Second example, a simple theorem about complex numbers:
The above proof is generated by mm-hammer and it fails the verifiers. Below, I provide a correct proof that I created manually:
Third example. Classical propositional calculus. The statement below is a valid theorem of logic:
The above proof is provided by me and it's correct, the proof below is generated by mm-hammer and it's incorrect:
Results about propositional calculus are oddly inconsistent, theorem
walsh3
generated a correct proof:While
walsh2
even failed to return a proof:output:
But I know that
walsh2
is true since I proved it by hand:This surprises me, in my experience Prover9 was able to find very long proofs of propositional logic and theorem
walsh2
doesn't really require a long one as I showed above. Is there a mm-hammer setting that would allow to search deeper (presumably with performance costs)?The error messages aren’t very helpful either. I am aware that mm-hammer is probably best suited for simple statements, but when it fails, it would be beneficial to receive more information about why it failed, so that I can get a better idea about the types of problems this tool can handle.