fwaris / LLMLogicPuzzle

2 stars 1 forks source link

WheelsAround_LLMSMTLIB.txt gives wrong output from Z3 #1

Closed v-jaswel closed 3 months ago

v-jaswel commented 3 months ago

In your earlier article, you say:

"For the hard task, GPT 4 turbo generated a response that looked as if the model had solved the puzzle. However, on closer inspection, and after solving the puzzle with an SMT Solver, it was discovered that the response was entirely incorrect. The output was a hallucination."

The output in question:

- David is 10, riding a Pink bike, and has a Roast Beef sandwich.
- Gregory is 14, riding a Red bike, and has a Cheese sandwich.
- Ivan is 12, riding a Blue bike, and has a Turkey sandwich.
- Eric is 11, riding a Yellow bike, and has a Peanut Butter sandwich. 

The correct solution, as you note here, and which I checked by solving the puzzle manually on ahapuzzles, is:

Ivan
    Blue
    |10|
    |Roast Beef|
Gregory
    Yellow
    |14|
    Cheese
Eric
    Red
    |12|
    |Peanut Butter|
David
    Pink
    |11|
    Turkey

In your more recent article, you say:

"SMTLIB Formula After some simple prompt engineering, I got GPT-4 to generate the following, correct representation of the above puzzle in SMTLIB format: (code omitted) I was able to run this program in Z3 and obtain the correct solution - without any modification."

I don't have access to GPT 4, but if I simply copy the SMTLIB code listed in that article, paste into a file, and run it with Z3 for 64-bit Windows, version 4.13.0, I get this output:

sat
(
  (define-fun Pink () Int
    1)
  (define-fun Red () Int
    2)
  (define-fun RoastBeef () Int
    1)
  (define-fun Yellow () Int
    4)
  (define-fun Age10 () Int
    1)
  (define-fun Turkey () Int
    3)
  (define-fun David () Int
    1)
  (define-fun Eric () Int
    4)
  (define-fun PeanutButter () Int
    4)
  (define-fun Ivan () Int
    3)
  (define-fun Age12 () Int
    3)
  (define-fun Age14 () Int
    2)
  (define-fun Gregory () Int
    2)
  (define-fun Age11 () Int
    4)
  (define-fun Cheese () Int
    2)
  (define-fun Blue () Int
    3)
)

This looks to me like it matches the incorrect output from GPT 4 described previously.

The SMTLIB code in this repo is identical to what's in the article.

My best guess is that the SMTLIB code in this repo and in the article is the incorrect code that was generated by GPT 4 and resulted in the incorrect output. That said, I can't see the error in the SMTLIB code, but I haven't worked with SMTLIB before.

Apologies if I'm missing anything obvious here, as I'm new to all of this.

v-jaswel commented 3 months ago

Found it. This assertion is wrong:

(assert (= (- PeanutButter Turkey) 1))

The clue is "The boy with the Peanut Butter sandwich is directly before the boy with the Turkey sandwich."

However the assertion says "peanut butter minus turkey equals 1." E.g. if turkey is in place 1 and peanut butter is in place 2, that would satisfy this assertion, but not the clue.

If I change the assertion to:

(assert (= (- Turkey PeanutButter) 1))

I get the following output from Z3 that looks correct:

sat
(
  (define-fun Pink () Int
    4)
  (define-fun Red () Int
    3)
  (define-fun RoastBeef () Int
    1)
  (define-fun Yellow () Int
    2)
  (define-fun Age10 () Int
    1)
  (define-fun Eric () Int
    3)
  (define-fun David () Int
    4)
  (define-fun Turkey () Int
    4)
  (define-fun PeanutButter () Int
    3)
  (define-fun Ivan () Int
    1)
  (define-fun Age12 () Int
    3)
  (define-fun Age14 () Int
    2)
  (define-fun Gregory () Int
    2)
  (define-fun Age11 () Int
    4)
  (define-fun Cheese () Int
    2)
  (define-fun Blue () Int
    1)
)

It's interesting that GPT 4's incorrect response matches the incorrect response from Z3 caused by incorrect SMTLIB code.

fwaris commented 3 months ago

thanks for checking!

I did do a cursory check but apparently not detailed enough.

I will take another look and update the article accordingly

Get Outlook for Androidhttps://aka.ms/ghei36


From: Jason Wells @.> Sent: Sunday, March 24, 2024 10:58:19 AM To: fwaris/LLMLogicPuzzle @.> Cc: Subscribed @.***> Subject: Re: [fwaris/LLMLogicPuzzle] WheelsAround_LLMSMTLIB.txt gives wrong output from Z3 (Issue #1)

Found it. This assertion is wrong:

(assert (= (- PeanutButter Turkey) 1))

The clue is "The boy with the Peanut Butter sandwich is directly before the boy with the Turkey sandwich."

However the assertion says "peanut butter minus turkey equals 1." E.g. if turkey is in place 1 and peanut butter is in place 2, that would satisfy this assertion, but not the clue.

If I change the assertion to:

(assert (= (- Turkey PeanutButter) 1))

I get the following output from Z3 that looks correct:

sat ( (define-fun Pink () Int 4) (define-fun Red () Int 3) (define-fun RoastBeef () Int 1) (define-fun Yellow () Int 2) (define-fun Age10 () Int 1) (define-fun Eric () Int 3) (define-fun David () Int 4) (define-fun Turkey () Int 4) (define-fun PeanutButter () Int 3) (define-fun Ivan () Int 1) (define-fun Age12 () Int 3) (define-fun Age14 () Int 2) (define-fun Gregory () Int 2) (define-fun Age11 () Int 4) (define-fun Cheese () Int 2) (define-fun Blue () Int 1) )

— Reply to this email directly, view it on GitHubhttps://github.com/fwaris/LLMLogicPuzzle/issues/1#issuecomment-2016835930, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACGZMGDKFCAFHKXXCUCKS3YZ3SYXAVCNFSM6AAAAABFE4KSSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWHAZTKOJTGA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

fwaris commented 3 months ago

I have updated the article to reflect your finding. Thanks again.

v-jaswel commented 3 months ago

My pleasure, thank you. I'm interested in the idea of using AI code gen with languages like SMTLIB, F*, etc. where we can prove correctness (as I understand it), and I hope you'll write more on this subject.

v-jaswel commented 3 months ago

I was wondering why z3 was able to solve the incorrect SMTLIB code when I would expect it to return unsat. It turns out, if you reverse clue 4 so that Turkey is directly before Peanut Butter, then the solution GPT 4 gave you is correct (i.e. fits all the clues):

  1. David/10/Pink/Roast Beef
  2. Gregory/14/Red/Cheese
  3. Ivan/12/Blue/Turkey
  4. Eric/11/Yellow/Peanut Butter

So the incorrect output might have simply arisen from GPT misunderstanding rule 4. Clues 4 and 8 reference the cyclists' positions without using "position" explicitly, which might be the issue.

I tried running the problem through GPT 3.5 twice. The first time it gave me results with two red bikes, and no one in position 2 but one cyclist in position 5. The second time it did better, but still gave an incorrect solution and forgot to mention the cyclists' ages.

I then tried running it through GPT 4. It also gives me incorrect solutions (e.g. with David riding the red bike). If I ask it for SMTLIB code, it gives me code that is incomplete (not all clues present) or buggy (z3 complains). But I'm new to prompt engineering.

If you get a moment, would you mind to share the prompt you used to generate the SMTLIB code? I'd like to see if I can get GPT to correctly interpret clue 4 (maybe by adding the word "position") and if it will then generate the correct code.

fwaris commented 3 months ago

The prompt is in the LLMSandbox.fsx file in this repo (also linked to at the end of the linked-in article)

Yes prompts matter (given the current state of models).

One idea is to perform reflection, i.e.:

Another idea is to run the same prompt multiple times and if multiple runs give you the same answer, that is likely the right answer (of course nothing is guaranteed).

Potentially one can also fine tune a smaller model to generate accurate SMTLIB code.

PS: I considered F* briefly but when I tested with SMTLIB and saw promising responses, I stuck with it. I think there are probably more examples of SMTLIB in LLM training data.