eidolon-ai / eidolon

The first AI Agent Server, Eidolon is a pluggable Agent SDK and enterprise ready, deployment server for Agentic applications
https://www.eidolonai.com/
Apache License 2.0
292 stars 32 forks source link

Basic Gemini Support #831

Open coryfoo opened 1 month ago

coryfoo commented 1 month ago

Adds basic support for the Google Gemini LLM (gemini-1.5-flash). Does not include Gemini tool support for multi-modal interactions using Gemini-based models (ie, no speech, pictures, etc).

I've taken the code from the existing branch and updated it so that the protobuf serialization works, specifically when declaring tool support. You are able to interact with a chat-only version of this implementation which works fine.

The tool calling portions are untested (could use some help with setting this up locally!).

coryfoo commented 1 month ago

The documentation for image and audio uploads seems to suggest that it does not support one-shot model prompts that include the image data embedded within the content as other models do, but rather you must upload a file and refer to that upload in the prompt. Because of that, I wasn't sure of the best approach to take to add support for that paradigm and would be happy to chat about potential implementation strategies @dbrewster @LukeLalor ?

LukeLalor commented 1 month ago

The documentation for image and audio uploads seems to suggest that it does not support one-shot model prompts that include the image data embedded within the content as other models do, but rather you must upload a file and refer to that upload in the prompt. Because of that, I wasn't sure of the best approach to take to add support for that paradigm and would be happy to chat about potential implementation strategies @dbrewster @LukeLalor ?

What is their api? Uploading the docs would be fine, but we would obviously need to keep track of what has been uploaded. This is fine, but obviously just another piece of complexity floating around. We could also have a V1 impl that does not support multi-media natively are relies on the image processors (this is how we do multi-media support for text-only models). Either option would be perfectly reasonable, but I would probably implement the 2nd and hold off on #1 until we have customers clamoring for it. I am on vacation this week, but will ping @dbrewster to see if he has thoughts.