Agenta-AI / agenta

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.
http://www.agenta.ai
MIT License
1.26k stars 186 forks source link

[AGE-163] Propagating the cost from Span to Trace #1595

Closed mmabrouk closed 5 months ago

mmabrouk commented 6 months ago

Right now the user needs to explicitly return in the traced function a dict that contains the cost, message, and number of tokens. However, this information is simply the sum of costs and tokens used in all the spans for this trace. So, instead we want to propagate the cost from the span to the trace.

First, we need to determine whether to do the calculation in the SDK, backend or frontend. It looks like the SDK is the right place to do that.

This issue goes hand in hand with another issue for changing the way the playground interacts with the LLM apps (Removing FuncResponse).

We need to determine therefore the schema for the output of the LLM applications. Right now it includes the message, the cost, and the number of tokens.

A first proposal is to require the user to only provide the message, the output should have the output and the trace_id (and the cost/tokens inferred from the trace?)

From SyncLinear.com | AGE-163

mmabrouk commented 6 months ago

Some clarifications:

I think we should not require users to return FuncResponse in their application. It is extremely hard to use. However I am not sure whether we should still create this FuncResponse implicitly from the SDK. It would be nice if we could keep the output of the LLM applications created with the SDK the same, just adding trace_id. However, I am not sure how the @entrypoint can fetch from the tracing object the cost and number of tokens..

If that is too convoluted, we would just remove cost/tokens from the output schema of the LLM app, and use the trace_id in the playground and evaluation (when available) to show the cost/number of tokens

aybruhm commented 6 months ago

However I am not sure whether we should still create this FuncResponse implicitly from the SDK. It would be nice if we could keep the output of the LLM applications created with the SDK the same, just adding trace_id.

QUICK NOTE: your clarification will only affect users that use observability decorators (ag.span). Integrating our callback handler through litellm will resolve these concerns for them. Additionally, instrumenting OpenAI will also fix the issue.

Regarding your concern, we could allow users to return only the output of the LLM app, while the SDK handles the FuncResponse. As for tracking the cost and token usage of their LLM app, it seems reasonable to have them ingest the data themselves if they won't be using litellm or the OpenAI instrumentation (that will be available at a later date).

Here's a quick example of how they would ingest the data themselves:

import openai
import agenta as ag

default_prompt = (
    "Give me 10 names for a baby from this country {country} with gender {gender}!!!!"
)

ag.init()
tracing = ag.llm_tracing()
ag.config.default(
    temperature=ag.FloatParam(0.2), prompt_template=ag.TextParam(default_prompt)
)

@ag.span(type="llm")
async def gpt_4_llm_call(prompt: str) -> str:
    response = await openai.gpt.create(prompt=prompt, temperate=ag.config.temperature, model="gpt-4")
    token_usage = response.usage.dict()
    tracing.set_span_attribute(
        "llm_cost", 
        {"cost": ag.calculate_token_usage("gpt-4", tokens_usage), "tokens": tokens_usage}
    ) # <-- RIGHT HERE 👋🏾
    return response.choices[0].message.content

@ag.entrypoint
async def generate(country: str, gender: str) -> str:
     prompt = ag.config.prompt_template.format(country=country, gender=gender)
     return await gpt_4_llm_call(prompt=prompt)

However, I am not sure how the @entrypoint can fetch from the tracing object the cost and number of tokens.

The entrypoint decorator has access to the tracing object, which also has access to the method that calculates the cost and tokens.

Let me know what your thoughts are.

mmabrouk commented 6 months ago

@aybruhm Yep, I agree.

aybruhm commented 6 months ago

It would be nice if we could keep the output of the LLM applications created with the SDK the same, just adding trace_id. However, I am not sure how the @entrypoint can fetch from the tracing object the cost and number of tokens.

New development: While the SDK has access to the tracing object, it doesn't have direct access to the cost and number of tokens. This is because the LLM app needs to have run before the Tracing SDK calculates the sum of cost and tokens of all the trace spans. To address this, we can return the trace_id along with the LLM FuncResponse response.

However, this approach adds complexity, particularly for the OSS version. In our cloud and enterprise versions, observability is available, and returning the trace_id to the frontend to retrieve the sum of cost and token usage for their LLM app run from the backend is feasible.

For the OSS version, we need to find an alternative solution. Also, we should consider adding documentation suggesting how they can track cost and token usage. Just like we have it now:

@ag.span(type="llm")
async def llm_call(...):
    response = await client.chat.completions.create(...)
    tracing.set_span_attribute(
        "model_config", {"model": model, "temperature": temperature}
    )
    tokens_usage = response.usage.dict()  # type: ignore
    return {
        "cost": ag.calculate_token_usage(model, tokens_usage),
        "message": response.choices[0].message.content,
        "usage": tokens_usage,
    }

What are your thoughts, @mmabrouk?