Some questions, the main question is how does llm interact with the system

glen0125 commented 11 months ago

After reading the Workflow code, I would like to share my understanding and questions to see if it aligns with the overall design. As I am interested in llm kernel packaging, my focus is primarily on this area.

A role contains an agent (1:1?), and an agent may have multiple chatsessions (1:n). So what is the timing of creating a chatsession? When is it decided by the owner or by the compute_kernel to create a new chatsession?
Each llm call means that the compute_kernel thinks that the task is temporarily unknown how to split, or that the task has reached the minimum granularity (such as merging results), and it is up to the compute_kernel to decide when to call up llm?
result = await compute_kernel().do_llm_completion(prompt,the_role.agent.get_llm_model_name(),the_role.agent.get_max_token_size()) Which llm model to use is determined when creating the agent instance, and cannot be changed after creation. The agent and llm instance are m:n relationships. max_token_size is an attribute of the agent? What is the relationship with llm's max_token?
Based on point 3, the compute_kernel needs to dynamically create and maintain llm instances at runtime, or is it that llm actively registers with the compute_kernel?
Each llm instance should have its own description of capabilities, and have an interface to query the current status (busy or idle). Most of the api (service) mode llm capabilities are already determined. The difference in llm capabilities is mainly reflected in hardware capabilities (when deployed locally) (this can be optimized later).
The llm instance has no context, and each time it senses the context based on the prompt passed in by the compute_kernel call.
The context of a chatsession may be huge, should the prompt passed to llm be trimmed?
Does the compute_kernel go to find an idle llm instance to execute tasks, or does the compute_kernel only submit tasks, and let the llm client queue up by themselves?

glen0125 commented 11 months ago

Does llm only interact with compute_kernel and is it hidden from other modules?

waterflier commented 11 months ago

Does llm only interact with compute_kernel and is it hidden from other modules?

All components interact with LLM through the interface provided by compute_kernel.

waterflier commented 11 months ago

Hello,

I would like to clarify some aspects of our system:

Role and Agent Relationship: The relationship between a role and an agent is akin to an employee holding a position in a company. The company's workflow describes the function of the position, and different agents can hold the same position. Our design does not exclude one agent from holding multiple roles in a workflow, but it's best not to do so. Chat sessions are a separate issue; fundamentally, the state of the workflow consists of chat sessions. I might write an article about this, but for now, you can understand how it works by reading the code, especially which messages enter which chat sessions.

LLM Completion: Yes, currently, LLM completion is indivisible and is determined by the compute_kernel as to when it will execute. If there are no compute nodes in the compute_kernel that support LLM completion, the call will fail. LLM completion is a simple abstraction that allows our system to get up and running quickly from the MVP's perspective.

LLM Mode and Max Token: In fact, information like LLM and max_token is specified by the agent, which forms part of the agent's persona. Roles can change agents at any time, but once an agent is established, its LLM configuration cannot be changed. The max_token_size is the max_token of the LLM.

Compute Kernel: The compute_kernel is a singleton and is the most critical basic component of the system, initialized first. We can add/remove compute nodes it sees at runtime to control its behavior. There will undoubtedly be a lot of optimization opportunities in the future, but it's not the focus of this version. Done is better than perfect!

Compute Node Capabilities: The definition of a compute node's capabilities is not surprising. "Whether it supports LLM" is a capability, not a refinement of some functions within the LLM.

Chat Session and Historical Records: According to the token limit of the LLM, the most recent chat records are loaded. In fact, our design leaves it to the workflow developer to define the method of obtaining prompt words from the chat session's historical records.

Compute Kernel Scheduling: The implementation of the compute_kernel's schedule method contains a clever algorithm for finding compute nodes. This implementation can be simple or brilliant, but it will never be overly complicated.

I hope this clarifies your queries. Please feel free to reach out if you have any further questions.

wugren commented 11 months ago

There should be several levels of history:

The history of the current session should have the highest priority and should be kept as full as possible
Records of historical sessions of the same user
Historical session records of different users 2.3. The historical records may be quite large and cannot be carried entirely. This should be useful information abstracted and placed in the latest session, but I haven't figured out how to do it properly

fiatrete / OpenDAN-Personal-AI-OS

Some questions, the main question is how does llm interact with the system #35