[Feature]: Auto-convert message content for non-vision models

The Feature

Vision models support message content as a list:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Hello!"
    }
  ]
}

vs. the traditional string format:

{
  "role": "user",
  "content": "Hello!"
}

When I pass the list format to a non-vision model, LiteLLM should automatically convert it to the string format for me.

Presently, attempting this with LiteLLM and e.g. "mistral/mistral-medium" results in a format error from Mistral API.

Here's a more complex example showing how I think it should work:

If I pass the following message format to a non-vision model:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Hello!"
    },
    {
      "type": "text",
      "text": "What’s in this image?"
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
      }
    }
  ]
}

LiteLLM should automatically convert it to:

{
  "role": "user",
  "content": "Hello!\nWhat’s in this image?"
}

(Joined all text entries with \n and removed all image entries)

Maybe this deserves a warning printout too? To let the user know that they attempted to pass images to a non-vision model and they were removed automatically.

Motivation, pitch

Supporting "message content as a list" universally is advantageous because it lets developers stick with 1 format in their code that reliably works with both vision and non-vision models.

Lack of support for this in LiteLLM has forced me to include the following "bandaid fix" in my project: https://github.com/jakobdylanc/discord-llm-chatbot/blob/2eda3f4ad7f7b741776519033f5da23cd70ca2e6/llmcord.py#L94-L95

Twitter / LinkedIn details

No response

BerriAI / litellm