Add multimodal support in prompt_template for easier prompting.

brenkao commented 1 month ago

Description

First thing that comes to mind is to add something like this

prompt_template = """
SYSTEM: 
...

USER:
...

IMAGE:
{image_url}

willbakst commented 1 month ago

Since gpt-4o is multi-modal with audio too (and it looks like other players are headed in this direction), it's likely worth also thinking about how to handle both images and audio

off6atomic commented 1 month ago

Note that users can pass in several images and images can be a part of any user message, not just the final message. A message content can either be a string or a list. If user pass in only a string, assume that it's just text. If user passes in a list, assume that it's text, images, audio, etc.

I suggest something like this:

prompt_template = """
SYSTEM: 
...

USER:
<you need a way to pass a string or a list of multiple input types here>

ASSISTANT:
<AI can also output text, audio, images, etc. Just like in GPT-4o>

USER:
<you need a way to pass a string or a list of multiple input types here>
"""

History should also take this into account.

I'm not sure if this is the right abstraction. Maybe OpenAI abstraction of representing chat history as a list is already great. Because this is a chat model, not an instruct model, if you want to utilize python features, you should model the prompt template as a list, not as a string.

It feels like we are repeating the mistake of LangChain. Forcing prompt_template to be a string is introducing magic unnecessarily isn't it? (You are forced to come up with your own markup language similar to YAML or TOML)

willbakst commented 1 month ago

I agree, which is why we originally opted for enabling writing the messages array directly.

We also enable the MESSAGES: keyword for injecting a list of messages for chat history. This will inject any messages as-is into the messages array and can be used wherever and however many times you want in the prompt template just like other keywords.

I do think there is a way to update the prompt template parser to provide good DX for the multi-modal case; however, as you mentioned I also don't think the solution is a self-defined language.

Just because a multi-modal user messages has a content array doesn't necessary mean that we also need to have an array in the prompt template through a custom language. In fact, I think there is potentially a rather nice way of writing multi-modal messages still as a single string. For example:

from mirascope.openai import OpenAICall, OpenAIImage

class MultiModalCall(OpenAICall):
    prompt_template = "Can you please describe this image? {image}"

    img_bytes: bytes

    @property
    def image(self) -> OpenAIImage:
        return OpenAIImage(media_type="jpeg", bytes=self.img_bytes)

To me, this feels more natural as a transcript and how I would generally interact with the chat model anyway. Then, under the hood we can parse the user message into the correct content array if images are provided.

Of course we would also want to ensure:

Multiple images can be passed into a single user message just like in the content array. Something like: prompt_template = "Image 1: {image1}, Image 2: {image2}"
Additional convenience for passing in multiple images altogether. Something like: prompt_template = "Images: {images}"
Ability to pass in a URL and not the bytes for additional convenience around not having to load the image manually.
Similar convenience for audio files now that it looks like providers are headed in that direction (re: gpt-4o and gemini).

What do you think @off6atomic @brenkao?

brenkao commented 1 month ago

If this works across all our various providers, then I'm all for it.

off6atomic commented 1 month ago

@willbakst I think that's a better syntax for producing a list of inputs indeed. I totally missed that.

However, I still think this is a custom markup, which means it needs to be very easy for users to understand how it's parsed and there should be a page that explicitly explains how the custom markup is parsed into OpenAI format (or internal Mirascope format).

I would suggest using this syntax in a way that tells to the user it's simply being parsed to a list (and users can specify order of the items in the list).

For example, if user wants to pass [image, text, image] Maybe we can allow something like this: "{image1} Please describe the difference between left and right image {image2}" If the user wants to pass 2 images without text then they just need to provide no text between those images e.g. "{image1} {image2}"

Users should also be allowed to type the inputs in multiple lines e.g.

"""
USER:
What is the following image?
{image}

How does it relate to the following audio and video?
{audio} {video}

I want you to describe the relationship in {style} tone.
"""

would be translated to [text, image, text, audio, video, text]

I think this provides a simple mental representation for users to understand the parser. It's just splitting the string by non-text inputs.

One thing we need to be clear to users is how we remove whitespaces and newlines surrounding non-text inputs. Should we remove all the whitespaces and newlines surrounding non-text inputs? I think if we do that then it's simpler to understand and most of the time the users are simply going to put non-text inputs at the end of the message anyway.

Here is a typical use case:

"""
USER:
Please look at the cat and dog images and tell me which one is more cute.

{cat_image} {dog_image}
"""

Note that None image should be allowed, in such case it will simply not create an item in the list that we send to OpenAI.

willbakst commented 1 month ago

@off6atomic 100%, everything you've described is pretty much exactly the behavior I would expect. The goal is for the parser to feel intuitive and behave how you would expect so it's "convenient" and not "magic" (but still feels like magic).

Of course, I totally agree that in order for this not to be "magic" we need to have extremely clear documentation. For the README examples, this will likely be simple comments + examples of what the output messages will look like so it's succinct. In the concepts/writing_prompts.md docs page we should add a more detailed writeup of exactly what is happening under the hood so it's extremely clear to users what's happening. We can also mention in the README with this update that users should read the docs for more details.

How we handle the parser will need some more thought as we work towards implementing this feature and see what makes sense both from an internal implementation perspective as well as the external DX perspective. Mostly want to make sure that any decisions we make for parsing image/audio prompts doesn't have unintended effects on other prompts.

I'm hoping to find some time soon to prioritize this feature now that we've got a good idea of the interface and DX.

Mirascope / mirascope

Add multimodal support in prompt_template for easier prompting. #242

Description