getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.58k stars 358 forks source link

Supporting Vision models from Groq #65

Open VedantR3907 opened 1 month ago

VedantR3907 commented 1 month ago

I tried using Vision models like llama-3.2-90b-vision-preview, llama-3.2-11b-vision-preview, llava-v1.5-7b-4096-preview but it shows same thing as: image

pradhyumna85 commented 1 month ago

@VedantR3907, the issue is due to the way litellm validates if a given model has vision capability or not. So litellm maintains a list of models with their properties and capabilities in a static json and the llama 3.2 models (including the vision models) are not added in it. https://github.com/BerriAI/litellm/blob/fb523b79e9fdd7ce2d3a33f6c57a3679c7249e35/litellm/utils.py#L4974 https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json

For now you try to uninstall zerox and install from this fork (#40): pip install git+https://github.com/pradhyumna85/zerox.git@formatting-control

and pass validate_vision_capability=False in the zerox function and see if that solves.

VedantR3907 commented 1 month ago

I tried to pass the parameter, We don't have the parameter passed to the Zerox function: image

So, There was the same problem with llava models in ollama, So I tried a workaround to see if it works and it foes for ollama llava model as below: pyzerox\models\modellitellm.py image

But the same I tried with groq, Which gave me a error from groq i guess: image

pradhyumna85 commented 1 month ago

@VedantR3907, there was some issue as all the kwargs were passed to litellm, fixed that, remove and reinstall pyzerox using the same pip command shared eariler, however this time there is a new error: image it seems like groq's llama 3.2 vision models don't support system message with messages with images.

Looks like litellm hasn't added support for vision models from groq.

pradhyumna85 commented 1 month ago

@VedantR3907, in the latest versions of litellm 1.50.1 (we are using a lower version in pyzerox), I can get image prompting to work with llama 3.2 vision but the current implementation in pyzerox backend uses system prompt for instructions which groq backend doesn't support along with image input. Feel free to fork the repo to adapt the model class in pyzerox\models\modellitellm.py to remove system prompt and providing the instructions in user prompt may be to see if that works, if that goes well then you can raise a PR for the same, we just need to make sure that doesn't break existing models.

VedantR3907 commented 1 month ago

@pradhyumna85, I made changes in the modellitellm.py, It works now, But Currently still I am using the same system prompt passing as a text for groq models which is getting used for all the other models. We can change that cause for bigger models it is working perfectly (for Groq). But smaller models is not perfect but good.

I only changed the _prepare_messages function from the modellitellm.py

`async def _prepare_messages( self, image_path: str, maintain_format: bool, prior_page: str, ) -> List[Dict[str, Any]]: """Prepares the messages to send to the LiteLLM Completion API. :param image_path: Path to the image file. :type image_path: str :param maintain_format: Whether to maintain the format from the previous page. :type maintain_format: bool :param prior_page: The markdown content of the previous page. :type prior_page: str """ messages: List[Dict[str, Any]] = []

    # Check if the model belongs to Groq family (by checking if model starts with 'groq/')
    if self.model.startswith('groq/'):
        # Prepare the user message that includes system instructions and image
        user_content = []

        # Add system prompt content as text in the user message
        user_content.append(
            {
                "type": "text",
                "text": f"{self._system_prompt}",
            }
        )

        # If maintain_format is true, add prior page formatting
        if maintain_format and prior_page:
            user_content.append(
                {
                    "type": "text",
                    "text": f'Markdown must maintain consistent formatting with the following page: \n\n """{prior_page}"""',
                }
            )

        # Add image as part of the user message
        base64_image = await encode_image_to_base64(image_path)
        user_content.append(
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{base64_image}"},
            }
        )

        # Append the user message
        messages.append(
            {
                "role": "user",
                "content": user_content,
            }
        )

    else:
        # Default behavior for non-Groq models
        # Add system prompt as system message
        messages.append(
            {
                "role": "system",
                "content": self._system_prompt,
            }
        )

        # If maintain_format is true, add prior page formatting as a system message
        if maintain_format and prior_page:
            messages.append(
                {
                    "role": "system",
                    "content": f'Markdown must maintain consistent formatting with the following page: \n\n """{prior_page}"""',
                }
            )

        # Add image as part of the user message
        base64_image = await encode_image_to_base64(image_path)
        messages.append(
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                    },
                ],
            }
        )

    return messages`
MANOJ21K commented 2 weeks ago

@pradhyumna85, I made changes in the modellitellm.py, It works now, But Currently still I am using the same system prompt passing as a text for groq models which is getting used for all the other models. We can change that cause for bigger models it is working perfectly (for Groq). But smaller models is not perfect but good.

I only changed the _prepare_messages function from the modellitellm.py

`async def _prepare_messages( self, image_path: str, maintain_format: bool, prior_page: str, ) -> List[Dict[str, Any]]: """Prepares the messages to send to the LiteLLM Completion API. :param image_path: Path to the image file. :type image_path: str :param maintain_format: Whether to maintain the format from the previous page. :type maintain_format: bool :param prior_page: The markdown content of the previous page. :type prior_page: str """ messages: List[Dict[str, Any]] = []

    # Check if the model belongs to Groq family (by checking if model starts with 'groq/')
    if self.model.startswith('groq/'):
        # Prepare the user message that includes system instructions and image
        user_content = []

        # Add system prompt content as text in the user message
        user_content.append(
            {
                "type": "text",
                "text": f"{self._system_prompt}",
            }
        )

        # If maintain_format is true, add prior page formatting
        if maintain_format and prior_page:
            user_content.append(
                {
                    "type": "text",
                    "text": f'Markdown must maintain consistent formatting with the following page: \n\n """{prior_page}"""',
                }
            )

        # Add image as part of the user message
        base64_image = await encode_image_to_base64(image_path)
        user_content.append(
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{base64_image}"},
            }
        )

        # Append the user message
        messages.append(
            {
                "role": "user",
                "content": user_content,
            }
        )

    else:
        # Default behavior for non-Groq models
        # Add system prompt as system message
        messages.append(
            {
                "role": "system",
                "content": self._system_prompt,
            }
        )

        # If maintain_format is true, add prior page formatting as a system message
        if maintain_format and prior_page:
            messages.append(
                {
                    "role": "system",
                    "content": f'Markdown must maintain consistent formatting with the following page: \n\n """{prior_page}"""',
                }
            )

        # Add image as part of the user message
        base64_image = await encode_image_to_base64(image_path)
        messages.append(
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                    },
                ],
            }
        )

    return messages`

can you share what all changes you made @VedantR3907

VedantR3907 commented 2 weeks ago

@MANOJ21K See the code I shared above I passed the system prompt for the GROQ models as user's message, withing in the user_content list. copy paste the code above you will be able to use the system prompt written by @pradhyumna85

MANOJ21K commented 2 weeks ago

@pradhyumna85 can you share the modellitellm.py and py-zerox version