Open elimydlarz opened 1 month ago
Hey, thanks for the question! Unfortunately ~12 seconds per message is about the expected time assuming you are using openAI models. For some broader context: the add_episode
method does quite a bit to ingest new data. It extracts entities from the current message based on context, it summarizes entities, and it extracts facts between those entities. It also extracts any temporal information about these facts (a date range of when they were true). In addition, it searches the graph for similar entities and facts and deduplicates them, and combines any summary information from the duplicated summaries. It also determines if any facts invalidate other facts about the same entities.
So all that being said, there is quite a few things going on and LLMs aren't particularly fast. We have done some work to optimize the latency of the add_episode
method and definitely intend to do more into the future.
Now, some things that might help you in your use of Graphiti today. In our implementation of Graphiti at Zep we use low-latency LLM providers and self-hosted models to speed up inference time. We also recommend using smaller LLMs if you can, we optimize our prompts for LLMs around the size of get-4o-mini and Llama 3.1 70B.
It sounds like your use case is for conversations, and so you should probably use group_id
fields for your graph. Think of each group_id
as a separate subgraph, so you can have one for each separate user. You can run add_episode
calls with different group_ids
in parallel without anything slowing down. So basically, you could run 10 conversations for unique users in parallel and it would take the same 2 minutes for those 10 messages.
Furthermore, in a production environment its standard practice to keep the last few messages in context for a chatbot, and that should still provide plenty of time for those messages to be processed into the graph before they fall out of the context window.
Our general strategy for Graphiti is to do as much work and calculate as many artifacts on ingest as possible, as this can have a higher latency without effecting production use cases. The flip side of this is that our retrieval from the graph through our search methods are quite fast, and that is an area where users tend to be more latency sensitive, as they often need Graphiti's content to begin any LLM business logic or agent flows.
Thanks heaps for the explanation 🙏
I don't really want to partition by user - in this case the agent is having conversations with prospective sellers on one hand and buyers on the other, so it needs access to dialog with the seller when talking to the buyer, in order to answer questions and make recommendations.
If latency is high, what about throughput? Can many different sets of messages be added concurrently, or is there a single stream of messages being written sequentially?
Partitioning by user was an example, you can create a partition for any subgraph that you want, you'll just have to define a group_id
for each partition. Does it make sense to have a group_id
for each pair of buyer and seller? So if I have 2 buyers and 2 sellers I'd have 4 group_ids
?
buyer0 <-group_id_00-> seller0
buyer0 <-group_id_01-> seller1
buyer1 <-group_id_10-> seller0
buyer1 <-group_id_11-> seller1
I'm assuming you don't want information leakage between these conversations as they should be private for each pair? Because Graphiti is temporally aware, there is only a single stream per group_id
, but you can freely add episodes from different group_id
's in parallel. We are working on a bulk add_episode
process but it isn't ready yet, as it gets complex to preserve the temporal order of events when adding in bulk.
With only two participants in a conversation I don't think the throughput should be an issue though. You will likely keep 3-5 messages I the context window for your LLM anyways, which should give the graph plenty of time to update with the new data before the information drops out of the context window. And the search side of things will be comparatively fast so it won't add like ~15 seconds to your app's response time.
See this log output from Graphiti:
That's almost 2 minutes for 10 messages 😮
Using Graphiti server and Neo4J running in Docker as suggested in the server readme. The messages are just conversational messages, but in this case their content is fake (AI generated conversations), for testing.