mm-hammer produces wrong proofs

I followed the instructions in the README file: I made sure I had Rust and Lisp installed, I executed ./make.sh, it returned a long stream of text, and the example theorem hlopdNEW worked as expected. All good.

However, apart from hlopdNEW, most other statements either generate incorrect proofs or fail to deliver a proof at all. Below, I show a few examples:

${
  $d x w y z $.
  my_axun $p |- E. y A. z A. w ( ( z e. w /\ w e. x ) -> z e. y ) $=
    ( wel wa wex wi wal ax-un 19.23v bicomi albii exbii mpbi ) CDEDAEFZDGCBEZHZ
    CIZBGPQHDIZCIZBGABCDJSUABRTCTRPQDKLMNO $.
$}

The statement *_axun is another way of expressing the axiom of union. This formulation is not present in set.mm, so that single-step proofs are avoided. The proof above is provided by me by hand and it is approved by the verifiers. The proof below is generated by mm-hammer and it fails the verifiers:

${
  $d x w y z $.
  hammer_axun $p |- E. y A. z A. w ( ( z e. w /\ w e. x ) -> z e. y ) $=
    ( wph cv wcel wa wi wal wex wel elirrv pm2.21i ax-mp ) EFZPGZBCDCFZDFZGSA
    FGHRBFGIJJKZEELQTEMNO $.
$}

Second example, a simple theorem about complex numbers:

  ${
    hammer_times3i.1 $e |- A e. CC $.
    hammer_times3i $p |- ( A x. ( 2 + 1 ) ) = ( ( A + A ) + A ) $=
      ( c2 caddc c1 co 2cn ax-1cn adddii 2timesi mulcomli mulid2i oveq12i eqtri
      cmul ) AOCDEFFZAOCFZDAOEFZFZADAFZDAFZACEBGHIQTRADCATGBABJKEAAHBABLKMN $.
  $}

The above proof is generated by mm-hammer and it fails the verifiers. Below, I provide a correct proof that I created manually:

  ${
    my_times3i.1 $e |- A e. CC $.
    my_times3i $p |- ( A x. ( 2 + 1 ) ) = ( ( A + A ) + A ) $=
      ( c2 c1 caddc co c3 2p1e3 oveq2i 1p1e2 oveq1i eqtr2i ax-1cn addcli adddii
      cmul mulid1i 3eqtri mulcomli 2timesi eqtri ) ACDEFZPFAGPFZADDEFZPFZAEFZAA
      EFZAEFUBGAPHIUCAUDDEFZPFUEADPFZEFUFGUHAPUHUBGUDCDEJKHLIAUDDBDDMMNZMOUIAUE
      EABQIRUEUGAEUECAPFZUGUDAUKUJBUDCAPJKSABTUAKR $.
  $}

Third example. Classical propositional calculus. The statement below is a valid theorem of logic:

  my_prop $p |- ( ( -. ch -> ( ( -. th -> ta ) -> ch ) ) -> ( th -> ch ) ) $=
    ( wn wi conax1 con1i pm2.5g ax-1 ja ) ADZBDCEZAEBAEZMKBAFGLAMMLBACHGABIJJ
    $.

The above proof is provided by me and it's correct, the proof below is generated by mm-hammer and it's incorrect:

  hammer_prop $p |- ( ( -. ch -> ( ( -. th -> ta ) -> ch ) ) -> ( th -> ch ) ) $=
    ( wn wi frege44 ) ADZBDZCEZAEZEZJF $.

Results about propositional calculus are oddly inconsistent, theorem walsh3 generated a correct proof:

hammer_walsh3 $p |- ( ph -> ( ( -. ps -> ( ( -. ch -> th ) -> ( ph -> ta ) ) ) -> ( ( ta -> ps ) -> ( ch -> ps ) ) ) ) $=
   ( wn wi notnotr a1d a1i frege36 con1d com12 merlem4 jad ) ABFZCFDGZAEGZGZ
   EBGZCBGZGZPFZUBGZAUCUATUCBCBHIIJAQRUBQFZUBGZAUEUATCUEBCBQCQPCDKILMIJAAEUB
   AUBKEUBGZABCENJOOO $.

While walsh2 even failed to return a proof:

  hammer_walsh2 $p |- ( ps -> ( ( ph -> ( ps -> ch ) ) -> ( ( -. ch -> ( ( -. th -> ta ) -> ph ) ) -> ( th -> ch ) ) ) ) $=
   ? $.

output:

$ cargo run --release build/set.mm + example/hlopdNEW.mm hammer_walsh2 2> /dev/null
Vampire failed on tmp/ahammer_walsh2_thm.p.prem960, not reconstructing
Vampire failed on tmp/ahammer_walsh2_thm.p.prem480, not reconstructing
Vampire failed on tmp/ahammer_walsh2_thm.p.prem240, not reconstructing
Vampire failed on tmp/ahammer_walsh2_thm.p.prem120, not reconstructing
Vampire failed on tmp/ahammer_walsh2_thm.p.prem80, not reconstructing
Vampire failed on tmp/ahammer_walsh2_thm.p.prem40, not reconstructing

But I know that walsh2 is true since I proved it by hand:

  my_walsh2 $p |- ( ps -> ( ( ph -> ( ps -> ch ) ) -> ( ( -. ch -> ( ( -. th -> ta ) -> ph ) ) -> ( th -> ch ) ) ) ) $=
    ( wi wn notnotr ax-1 2a1dd syl pm2.5g con1i com23 com34 expi ja com13 com12
    id ) ABCFZFBCGZDGEFZAFZFZDCFZFZAUABUGFUEBAGZUFUBUDBUHUFFZFZUBGCUJCHCBCUHDCB
    IJKUCAUJUCGZBDUHCUKDBUHCFZDBULFZFUCDUMELMNOABUIABGUFLPQQRUABCUEDUATJQS $.

This surprises me, in my experience Prover9 was able to find very long proofs of propositional logic and theorem walsh2 doesn't really require a long one as I showed above. Is there a mm-hammer setting that would allow to search deeper (presumably with performance costs)?

The error messages aren’t very helpful either. I am aware that mm-hammer is probably best suited for simple statements, but when it fails, it would be beneficial to receive more information about why it failed, so that I can get a better idea about the types of problems this tool can handle.

When I first tried to replicate your results, I got errors about missing theorems, and the issue was that the set.lisp file was out of date relative to the set.mm file. ./build.sh will avoid overwriting set.lisp if it exists but this means it can get out of date. If in doubt, delete the build and tmp folders and run ./make.sh again.

`hammer_prop`

After the above, I was able to run it successfully resulting in:

hammer_prop $p |- ( ( -. ch -> ( ( -. th -> ta ) -> ch ) ) -> ( th -> ch ) ) $=
  ( wn wi notnotr a1d frege37 ja ) ADZBDCEAEZBAEZJDZABAFGBCAHI $.

which metamath-knife agrees is a correct proof.

`hammer_walsh2`

hammer_walsh2 $p |- ( ps -> ( ( ph -> ( ps -> ch ) ) -> ( ( -. ch -> ( ( -. th -> ta ) -> ph ) ) -> ( th -> ch ) ) ) ) $=
      ( wi wn ax-frege31 a1d a1i con1i pm2.21 pm2.5g wl-luk-imtrdi merlem4 expt
      jad ) BABCFZCGZDGZEFZAFZFZDCFZFZAGZUEFZBUFSUBUDSGZUDFZUFUHCDCHIJUFUBUAGZU
      DUFUAAUJUJUJFZUFUJUAUAHKJAUJLQUDUADCEMKNQJBRUCTFGZCFZUECULBOUCDCPNQ $.

produced in 1.4 min. In general, these things can fail or timeout and this product comes with no warranty. It's certainly not a guarantee that it will be able to prove propositional logic problems, because they are being given to prover9 in heavily encoded form along with a ton of other things.

`hammer_times3i`

I was able to replicate the wrong proof. This should not happen. It appears to be a result of the argument reordering in expressions such as co (that is, ( A F B )), which have their arguments in the order A, F, B in the encoding which is sent to the SMT solver but are in the order A, B, F in metamath due to the order of $f declarations. You can see the result of this in the proof display of metamath-exe:

 1     cA=cA                      $f class A
 2     cB=cmul                    $a class x.
 3       cA=c2                      $a class 2
 4       cB=caddc                   $a class +
 5       cF=c1                      $a class 1
 6     cF=co                      $a class ( 2 1 + )
 7   cA=co                      $a class ( A ( 2 1 + ) x. )
 8       cA=cA                      $f class A
 9       cB=cmul                    $a class x.
10       cF=c2                      $a class 2
11  11:cA=co                      $a class ( A 2 x. )
12     cB=caddc                   $a class +
13       cA=cA                      $f class A
14       cB=cmul                    $a class x.
15       cF=c1                      $a class 1
16  16:cF=co                      $a class ( A 1 x. )
17   cB=co                      $a class ( ( A 2 x. ) ( A 1 x. ) + )
18       cA=cA                      $f class A
19       cB=caddc                   $a class +
20       cF=cA                      $f class A
21  21:cA=co                      $a class ( A A + )
22     cB=caddc                   $a class +
23     cF=cA                      $f class A
24   cC=co                      $a class ( ( A A + ) A + )
25     cA=cA                      $f class A
26     cB=c2                      $a class 2
27     cC=c1                      $a class 1
28     axi.1=hammer_times3i.1     $e |- A e. CC
29     axi.2=2cn                  $p |- 2 e. CC
30     axi.3=ax-1cn               $a |- 1 e. CC
31   eqtri.1=adddii             $p |- ( A x. ( 2 + 1 ) ) = ( ( A x. 2 ) + ( A
                                                                       x. 1 ) )
32     cA=11                      $a class ( A 2 x. )
33     cB=21                      $a class ( A A + )
34     cC=16                      $a class ( A 1 x. )
35     cD=cA                      $f class A
36     cF=caddc                   $a class +
37       cA=c2                      $a class 2
38       cB=cA                      $f class A
39       cC=21                      $a class ( A A + )
40       axi.1=2cn                  $p |- 2 e. CC
41       axi.2=hammer_times3i.1     $e |- A e. CC
42         cA=cA                      $f class A
43         2timesi.1=hammer_times3i.1 $e |- A e. CC
44       mulcomli.3=2timesi         $p |- ( 2 x. A ) = ( A + A )
45     oveq1i.1=mulcomli          $p 
46       cA=c1                      $a class 1
47       cB=cA                      $f class A
48       cC=cA                      $f class A
49       axi.1=ax-1cn               $a |- 1 e. CC
50       axi.2=hammer_times3i.1     $e |- A e. CC
51         cA=cA                      $f class A
52         axi.1=hammer_times3i.1     $e |- A e. CC
53       mulcomli.3=mulid2i         $p |- ( 1 x. A ) = A
54     oveq12i.2=mulcomli         $p |- ( A x. 1 ) = A
55   eqtri.2=oveq12i            $p 
56 hammer_times3i=eqtri       $p

`hammer_axun`

This one seems to be a bug in metamath-knife (which is being used as a library here). The normal-form output is

      wph cv wph cv wcel vy vz vw vz cv vw cv wcel vw cv vx cv wcel wa vz cv vy
      cv wcel wi wal wal wex wph wph wel wph cv wph cv wcel vy vz vw vz cv vw
      cv wcel vw cv vx cv wcel wa vz cv vy cv wcel wi wal wal wex wph elirrv
      pm2.21i ax-mp $.

which can immediately be seen to have an error since wph cv is applying cv to wph which makes no sense since ph is a wff and cv turns setvar into class. Debugging the input to metamath-knife reveals something like:

0 := vw[]
1 := vz[]
2 := cv[1]
3 := cv[0]
4 := wcel[2, 3]
5 := vx[]
6 := cv[5]
7 := wcel[3, 6]
8 := wa[4, 7]
9 := wph[]
10 := cv[9]
11 := wcel[10, 3]

so cv is being correctly applied to vw (that is, one of the dummy variables).

Actually I take it back regarding hammer_axun, the input to metamath-knife is incorrect:

0 := vy[]
1 := vz[]
2 := vw[]
3 := cv[1]
4 := cv[2]
5 := wcel[3, 4]
6 := vx[]
7 := cv[6]
8 := wcel[4, 7]
9 := wa[5, 8]
10 := cv[0]
11 := wcel[3, 10]
12 := wi[9, 11]
13 := wal[2, 12]
14 := wal[1, 13]
15 := wex[0, 14]
16 := wph[]
17 := cv[16]
18 := wcel[17, 17]
19 := wn[18]
20 := elirrv[16]
21 := wi[18, 15]
22 := pm2.21i[18, 15, 20]
23 := wel[16, 16]
24 := ax-mp[18, 15, 23, 22]

The incorrect application is step 17 here, which gets reordered to be first because of post-order traversal.

When I first tried to replicate your results, I got errors about missing theorems, and the issue was that the set.lisp file was out of date relative to the set.mm file. ./build.sh will avoid overwriting set.lisp if it exists but this means it can get out of date. If in doubt, delete the build and tmp folders and run ./make.sh again.

Ok, I deleted the build and tmp folders and ran ./make.sh again. After this, I tried hammer_prop and mm-hammer gave me a different proof than yours but a correct one:

 hammer_prop $p |- ( ( -. ch -> ( ( -. th -> ta ) -> ch ) ) -> ( th -> ch ) ) $=
   ( wn wi pm2.21 con4d frege37 ja ) ADZBDZCEAEZBAEZJDZABJKFGBCAHI $.

Then I tried hammer_walsh2, but it gave me the same error message as before. After, I tried hammer_walsh3 as well, and this time it gave me a wrong proof:

hammer_walsh3 $p |- ( ph -> ( ( -. ps -> ( ( -. ch -> th ) -> ( ph -> ta ) ) ) -> ( ( ta -> ps ) -> ( ch -> ps ) ) ) ) $=
  ( wi wn bj-0 com3l ) EBFZABGCGDFAEFFFZCBFZEBAKLFFZHI $.

If I understand correctly, to make sure that I get a good result I should delete the build and tmp folders and run ./make.sh each time before requesting a theorem. However this is so incredibly slow, not only the ./make.sh takes a while to complete, but even deleting the build folder takes some time as it contains almost a hundred thousand files.

Since these generated proofs seem to be incorrect for different reasons, it might be useful to automatically verify them after they are produced. If a proof is correct, then mm-hammer could output: "The above proof has been verified". If it's incorrect it could output: "The above proof is incorrect, report it on https://github.com/digama0/mm-hammer/issues". It might also be worth making a script to prove 100 random theorems from set.mm and check (when they are produced) whether their generated proofs are correct.

No, you do not need to run make.sh every time, only when set.mm has changed, or you want to update mm-hammer. If you don't run it then it will only be able to find theorems existing in the version of set.mm it was trained on, which is probably fine for most purposes.

There is one specific bug I am tracking down (in metamath-knife) that will fix the issues with wrong proofs. I don't think it is necessary to add runtime validation since it should be correct by construction. Reporting bugs via the issue tracker like this is sufficient. (For the future though, please don't report 4 bugs on one issue. It makes it difficult to track things.)

No, you do not need to run make.sh every time, only when set.mm has changed, or you want to update mm-hammer.

Got it, thanks.

(For the future though, please don't report 4 bugs on one issue. It makes it difficult to track things.)

Ok, sorry for that. My default philosophy is: "before building a case, make sure to collect enough evidence", but I unnecessarily overdid it here.

digama0 / mm-hammer