Josh-XT / AGiXT

AGiXT is a dynamic AI Agent Automation Platform that seamlessly orchestrates instruction management and complex task execution across diverse AI providers. Combining adaptive memory, smart features, and a versatile plugin system, AGiXT delivers efficient and comprehensive AI solutions.
https://AGiXT.com
MIT License
2.6k stars 346 forks source link

Improve File Uploads, Vision Always On #1210

Closed Josh-XT closed 2 months ago

Josh-XT commented 2 months ago

Improve File Uploads to Memory

Some file types were identified as needing improvement when chunking into memory while testing. As result, we have several improvements.

Add PowerPoint (PPT/PPTX) upload support

When PowerPoints are uploaded, they will be converted to PDF and handled as PDFs are handled.

Improve PDF uploads

When a PDF file is uploaded, we typically grab the text from it using pdfplumber and chunk the information into memory, which has great results. In addition to that strategy, if a vision_provider is selected for the agent, it will also break the PDF up into images per page for the vision model to answer questions about, and any questions answered about images will be retained in conversational memory.

Improve XLS/XLSX uploads

Uploading XLS/XLSX previously would upload the first sheet to memory, it will now iterate over each sheet, convert it to CSV, and then handle each sheet as CSVs are handled.

Improve CSV uploads

When uploading a CSV or XLS/XLSX file, it will now turn each item into json and add that information to memory to create a new memory per item with reference to where it came from and when. This will greatly improve data analysis, which has also been improved with this update. If a spreadsheet is uploaded at the chat completions endpoint, it will autonomously do data analysis based on user input and output results of executed code for things like graphs from the data.

Vision Always On

With PDFs also splitting into images, it makes sense for context to keep vision on when necessary rather than only when the image is uploaded initially. If you upload an image in a conversation and have a vision_provider defined for your agent, it will send your input to the vision model + the image, get a description, add that to memories for the conversation to be injected by context from the user's input. If relevant enough to the conversational memories, it will use the vision model with each interaction with the image in context essentially now.