Add docs similar to llama-index Token Counting Handler

nick-youngblut commented 6 months ago

For using tokencost with llama-index, it would be helpful to include info in the tokencost docs on how to use tokencost will a llama-index callback manager, as described in https://docs.llamaindex.ai/en/stable/examples/callbacks/TokenCountingHandler.html

areibman commented 6 months ago

This is a good idea. It also benefits from the fact that LlamaIndex already has a TokenCountingHandler callback. In your view, what would this look like? Just some extra attributes?

Something like: token_counter.total_prompt_cost?

One note of caution: Tokenizing, especially with tiktoken, is actually quite slow. Where you can help it, it's generally better to have a separate service handle tokenizing + counting. The main use case I've seen is people want to track their total token usage after making all their requests rather than estimating before.

For AgentOps, We've built out a basic callback handler for Langchain agents. I'm going to put LlamaIndex on our radar as well.

nick-youngblut commented 6 months ago

I'm mainly using the SQLTableRetrieverQueryEngine query engine in llama-index, so I can't simply use the user's prompt and the model response to calculate costs with tokencost, given that the query engine utilizes multiple prompt templates for the sql query and summarization.

I found the TokenCountingHandler docs, but this callback tokenizes instead of just providing the text for the entire prompt (& response).

It would be great if tokencost included a callback method that was fully compatible with llama-index, but didn't require tokenization with tiktoken, given it is slow (as you point out). Maybe just a modified version of the Aim Callback could be used?

areibman commented 6 months ago

I'm mainly using the SQLTableRetrieverQueryEngine query engine in llama-index, so I can't simply use the user's prompt and the model response to calculate costs with tokencost, given that the query engine utilizes multiple prompt templates for the sql query and summarization.

Yep, same issue with LangChain. Llama and LC do a lot of heavy lifting to abstract away the prompt manipulation, but it comes at the cost of the users not knowing what the exact inputs + outputs are. LangChain was too difficult for me to pull the prompts out, so I just relied on the prompt_tokens and completion_tokens they provide in the generations response. (I think this is provided by OpenAI). To calculate the cost, I just multiply it by the cost lookup values in the tokencost TOKEN_COSTS dict.

I don't know LlamaIndex well enough to know how to extract the prompts. Do they offer something analogous?

If you have a specific fix in mind, happy to merge it in.

nick-youngblut commented 6 months ago

I've built off of the SimpleLLMHandler to create a token cost calculator callback:

from typing import Any, Dict, List, Optional, cast
from llama_index.callbacks.base_handler import BaseCallbackHandler
from llama_index.callbacks.schema import CBEventType, EventPayload
from tokencost import calculate_prompt_cost, calculate_completion_cost, USD_PER_TPU

class TokenCostHandler(BaseCallbackHandler):
    """Callback handler for printing llms inputs/outputs."""

    def __init__(self, model) -> None:
        super().__init__(event_starts_to_ignore=[], event_ends_to_ignore=[])
        self.model = model

    def start_trace(self, trace_id: Optional[str] = None) -> None:
        return

    def end_trace(
        self,
        trace_id: Optional[str] = None,
        trace_map: Optional[Dict[str, List[str]]] = None,
    ) -> None:
        return

    def _calc_llm_event_cost(self, payload: dict) -> None:
        from llama_index.llms import ChatMessage

        if EventPayload.PROMPT in payload:
            prompt = str(payload.get(EventPayload.PROMPT))
            completion = str(payload.get(EventPayload.COMPLETION))
            prompt_cost = calculate_prompt_cost(prompt, self.model) / USD_PER_TPU
            completion_cost = calculate_completion_cost(completion, self.model) / USD_PER_TPU

        elif EventPayload.MESSAGES in payload:
            messages = cast(List[ChatMessage], payload.get(EventPayload.MESSAGES, []))
            messages_str = "\n".join([str(x) for x in messages])
            prompt_cost = calculate_prompt_cost(messages_str, self.model) / USD_PER_TPU
            response = str(payload.get(EventPayload.RESPONSE))
            completion_cost = calculate_completion_cost(response, self.model) / USD_PER_TPU

        print(f"# Prompt cost: {prompt_cost}")
        print(f"# Completion: {completion_cost}")
        print("\n")

    def on_event_start(
        self,
        event_type: CBEventType,
        payload: Optional[Dict[str, Any]] = None,
        event_id: str = "",
        parent_id: str = "",
        **kwargs: Any,
    ) -> str:
        return event_id

    def on_event_end(
        self,
        event_type: CBEventType,
        payload: Optional[Dict[str, Any]] = None,
        event_id: str = "",
        **kwargs: Any,
    ) -> None:
        """Count the LLM or Embedding tokens as needed."""
        if event_type == CBEventType.LLM and payload is not None:
            self._calc_llm_event_cost(payload)

I'm not sure if I should be calculating for messages_str and response

areibman commented 6 months ago

This is awesome, thanks! By the looks of it, this seems like the correct implementation. The only thing I'd caution is coming up with the correct name when self.model = model. There isn't really a standard dictionary of model names yet. This is what I'd hope to achieve with TokenCost, and LlamaIndex might have different names.

I don't know the best way to fit this in the repo, perhaps adding a callbacks README section. Any suggestions?

I've built off of the [SimpleLLMHandler](https://github.com/run- I'm not sure if I should be calculating for messages_str and response

Looks correct to me, is LlamaIndex returning messages as a messages dict (i.e. what OpenAI suggests using) or as a plain string? calculate_prompt_cost is designed to handle either, but there might be a slight over/underestimate unless we know exactly what format of string is being sent to OpenAI.

nick-youngblut commented 6 months ago

The only thing I'd caution is coming up with the correct name when self.model = model

I believe llama-index uses the name model naming as specified by Open AI (https://platform.openai.com/docs/models). So as long as tokencost supports all of those names, there shouldn't be an issue.

I don't know the best way to fit this in the repo, perhaps adding a callbacks README section. Any suggestions?

You could include the callback class to tokencost, but llama-index then becomes a dependency.

areibman commented 6 months ago

The only thing I'd caution is coming up with the correct name when self.model = model

I believe llama-index uses the name model naming as specified by Open AI (https://platform.openai.com/docs/models). So as long as tokencost supports all of those names, there shouldn't be an issue.

I don't know the best way to fit this in the repo, perhaps adding a callbacks README section. Any suggestions?

You could include the callback class to tokencost, but llama-index then becomes a dependency.

Created a PR with the callback handler. I imagine some folks using Langchain will want something similar, so it makes sense to add. I was able to add llama-index as an optional dependency. Simply pip install tokencost[llama-index], and the callback handler will be available.

Let me know if this helps; if so, I'll merge

https://github.com/AgentOps-AI/tokencost/pull/19

nick-youngblut commented 6 months ago

Let me know if this helps; if so, I'll merge

That's awesome! It will be very helpful to have the callback handler in tokencost versus adding the same code to each of my projects. Thanks!

areibman commented 6 months ago

Now available version 0.0.5 :)

AgentOps-AI / tokencost

Add docs similar to llama-index Token Counting Handler #16