gabegrand / world-models

182 stars 12 forks source link

cannot get church play space to run the first example #3

Closed murphyk closed 10 months ago

murphyk commented 12 months ago

I am using https://v1.probmods.org/play-space.html. I cut and paste the tug of war example but get the error query not defined

Screenshot 2023-07-20 at 8 43 55 PM
gabegrand commented 11 months ago

Hi @murphyk, have you tried copy/pasting the tug-of-war world-model.scm file into the Church play space? That one should work out-of-box.

From your screenshot, it looks like you're working with prompt.scm. While the prompts contain example language-to-Church translations, they're not standalone executable. Instead, you can substitute individual condition and query expressions (like the ones in that file) into the relevant sections in the world model file like so:

    ;; -- CONDITIONING STATEMENTS --
    (condition
      (and
        ;; Condition: Tom won against John.
        (won-against '(tom) '(john))
        ;; Condition: John and Mary won against Tom and Sue.
        (won-against '(john mary) '(tom sue))))
        ;; YOUR NEW CONDITIONING STATEMENTS BELOW
        ...

    ;; -- QUERY STATEMENT --
    ;; Query: How strong is Mary?
    (strength 'mary)
    ;; TODO: REPLACE WITH A DIFFERENT QUERY

This is a bit of a manual process for now. If there's interest, we've discussed releasing a more end-to-end demo. In the meantime, I've updated the README to clarify the role of prompt.scm files.

gabegrand commented 11 months ago

Thanks, by the way, for your interest in our work. I've learned a lot over the years from your textbooks and would welcome any thoughts on how we can improve the presentation of this material and/or on future directions that you find salient.

murphyk commented 11 months ago

I was able to run world-model.scm. But the main point of your paper is to generate these statements from natural language. What prompts did you feed into ChatGPT to generate prompt.scm? The workflow is rather unclear....

gabegrand commented 11 months ago

To answer your question: Each prompt.scm file is the prompt. The full workflow would look something like:

1. In OpenAI Playground or ChatGPT: Copy/paste prompt.scm and append a string to translate.

[prompt.scm] + "Condition/Query: A new natural language string." ==GPT Completion==> [Church code snippet]

2. In Church Play Space: 

[world-model.scm] + [Church code snippet] ==Church inference==> [Inference results]

Note that this is a bit of a manual process for now, since it requires substituting the Church code snippet into the appropriate spot in the world model. This is part of the reason it probably wasn't the most well-formalized in the docs. We have written more automated pipelines to produce some of the examples in the paper, which I hope to be able to release at some point soon.

murphyk commented 11 months ago

Oh I see. So I tried appending this to prompt.scm in GPT 4

;; Condition: Zach beat Alice in the first game
;; Condition: Alice lost the next game as well
;; Condition: and she lost the third and final one.
;; Query: How much better is Zach than Alice?

and it generated

;; First game: Zach beat Alice
(define first-game-result (won-against '(zach) '(alice)))

;; Since the other conditions imply that Alice lost all the games, we can assume the next two games have the same result.
(define second-game-result first-game-result)
(define third-game-result first-game-result)

;; Assuming all games have the same result, we can simply check the difference in strength between Zach and Alice.
(define (strength-difference player-1 player-2)
  (- (strength player-1) (strength player-2)))

(query (strength-difference 'zach 'alice))

which doesn't run because of the presence of the word query. If I omit query and replace with just strength-difference 'zach 'alice) inside of world-models.scm it does make some kind of plot shown below. But it is symmetrical and centered on 0, even though Z beat A 3 times, presumably because it doesn't condition on anything (ChatGPT forgot to generate condition). So it's a bit fragile :(

Screenshot 2023-08-08 at 5 36 43 PM
;; -- Tug-of-war in Church --
;; Author: Gabe Grand (grandg@mit.edu)
;; Adapted from https://v1.probmods.org/conditioning.html#example-reasoning-about-the-tug-of-war

;; -- WORLD MODEL --
(define (run-world-model)
  (rejection-query

    ;; This Church program models a tug-of-war game between teams of players.
    ;; Each player has a strength, with strength value 50 being about average.
    (define strength (mem (lambda (player) (gaussian 50 20))))

    ;; Each player has an intrinsic laziness frequency.
    (define laziness (mem (lambda (player) (uniform 0 1))))

    ;; The team's strength is the sum of the players' strengths.
    ;; When a player is lazy in a match, they pull with half their strength.
    (define (team-strength team)
      (sum
        (map (lambda (player)
               (if (flip (laziness player))
                   (/ (strength player) 2)
                   (strength player)))
          team)))

    ;; The winner of the match is the stronger team.
    ;; Returns true if team-1 won against team-2, else false.
    (define (won-against team-1 team-2)
      (> (team-strength team-1) (team-strength team-2)))

   ;; First game: Zach beat Alice
   (define first-game-result (won-against '(zach) '(alice)))

   ;; Since the other conditions imply that Alice lost all the games, we can assume the next two games have the same result.
   (define second-game-result first-game-result)
   (define third-game-result first-game-result)

   ;; Assuming all games have the same result, we can simply check the difference in strength between Zach and Alice.
   (define (strength-difference player-1 player-2)
     (- (strength player-1) (strength player-2)))

   (strength-difference 'zach 'alice)
))

;; -- UTILITY FUNCTIONS --
(define (count bool-list)
  (sum (map boolean->number bool-list)))

(define (argmax f lst)
  (if (null? (cdr lst))
    (car lst)
    (let ((higher-items (filter (lambda (x) (> (f x) (f (car lst)))) (cdr lst))))
      (if (null? higher-items)
        (car lst)
        (argmax f higher-items)))))

(define (argmin f lst)
  (if (null? (cdr lst))
    (car lst)
    (let ((lower-items (filter (lambda (x) (< (f x) (f (car lst)))) (cdr lst))))
      (if (null? lower-items)
        (car lst)
        (argmin f lower-items)))))

;; -- VISUALIZE QUERY --
(density (repeat 1000 run-world-model) "Zachs strength" true)
gabegrand commented 11 months ago

Hi Kevin, thank you for sharing your example -- this brings up a few very interesting points.

(1) Have you tried feeding in translations iteratively, instead of all at once? In the paper, we always translate a single utterance at a time; e.g.,

;; Condition: Zach beat Alice in the first game
==> (condition (won-against '(zach) '(alice)))

;; Condition: Alice lost the next game as well
==> (condition (won-against '(unknown-player) '(alice)))

...

This encourages the model to translate the literal meaning of each utterance (to the extent that this is possible, given vagueness inherent in statements like "Alice lost the next game").

(2) We've consistently found that newer instruction-tuned GPT models are not necessarily better at playing the role of the meaning function in our framework. In the paper, we ran all the examples with Codex at temperature=0. As a "pure" language model, Codex is (or was, since it's now unfortunately been deprecated) ideal for this kind of literal translation task, particularly at low temperature.

When we started writing this paper in late 2021, the decision to use Codex was informed by what was available. As OpenAI continued to iterate on their products, however, we found that the newer instruction-tuned GPT models tend to (a) generate more boilerplate ("sure, I'd be happy to translate..."); and (b) offer more "interpretive" translations that can sometimes be unnecessarily verbose. You see this in your example with GPT-4: wrapping the game results in functions (e.g., first-game-result) is superfluous and seems to make the model omit the critical step of conditioning on these results, which results in the symmetrical prior distribution you get out of inference. In sum, models that are aligned to human preferences and follow chain-of-thought style reasoning may actually not be as well-aligned to this translation task as language models that are not finetuned to act as virtual assistants.

(3) Regarding robustness: The examples in the paper are all translations that Codex produced at the time of writing -- and ones that were selected because they worked out-of-box for that model, meaning they didn't require special prompt engineering. We hypothesize that similar code-trained LLMs will have similar capabilities, but the point of the paper isn't to do a systematic benchmark of the constantly-evolving landscape of LLMs. (Though we certainly wish Codex were still available as that would make our examples much more easily reproducible.)

In general, the examples in the paper are meant to be illustrative and provocative. We want people to play with them, which is why we released the world models and prompts in this repo. And we hope that those personal explorations will elicit the same feelings of excitement we feel around the architectural patterns we introduce in the paper. But with respect to the specifics, and in the absence of the original Codex weights and inference pipeline, mileage may vary :)

Thanks for the thought-provoking discussion, which is giving me a lot to think about in terms of how to orient expectations around this code repo.

gabegrand commented 10 months ago

I'm going to go ahead and close this issue for now, since it seems like the initial issue (how to run tug-of-war in the Church play space) has been clarified. Thanks again for the discussion and please feel free to follow up here or get in touch.