Open BabyCNM opened 1 month ago
Hey @BabyCNM... thanks for submitting this and it looks useful. Would you be able to comment with some short examples of how images and audio are currently included, and then updated examples which work with your code?
I'm happy to test it out :)
Here is an example where current implementation would fail but the edited version will work.
prompt = """Read the screenshot image and the website's source code. Then, answer the user's question.
User Question: is the button below or above the image?
Screenshot: <img C:/User/xyz/Desktop/screenshot_3.jpg>
--- HTML Code ---
<!DOCTYPE html>
<html lang="en">
<img src="website/relative/path/300.jpg" alt="Placeholder Image">
<button onclick="alert('Button clicked!')">Click Me</button>
</body>
</html>
"""
Note there are two locations we have "<img" tag. However, the first one should be interpreted as an image to be sent for GPT-4o, and the other one (which is embedded in HTML code) should just be code rather than image.
Why are these changes needed?
The autogen tag parsing system uses HTML-like tags to allow users to input images and audio directly from text. However, this system may mistakenly interpret actual HTML content (such as website source code) as multimodal components for GPT-4o and other VLMs, which is undesirable.
Fortunately, autogen’s tag format differs from HTML. In autogen, file paths do not require quotation marks. To improve parsing accuracy, we’ve introduced a strict_filepath_match parameter for the multimodal utilities. When enabled (True), it ensures that only simple tag contents—without spaces or quotes—are matched, making it especially useful for detecting filenames and ignoring complex HTML syntax. This parameter is turned on (True) for parsing multimodal agents' messages.
Note: This is a custom tagging convention, which could be confusion for some users. Please share any recommendations regarding the current design. Further simplification of the message component is planned for future updates.
Related issue number
Checks