epam / ai-dial-adapter-openai

The project implements AI DIAL API for language models from Azure OpenAI
https://epam-rail.com
Apache License 2.0
8 stars 4 forks source link

Support token counting for streaming mode in GPT4-Vision #46

Closed adubovik closed 9 months ago

adubovik commented 9 months ago

Currently GPT4-Vision is always called in non-streaming mode to get the corrent usage from OpenAI.

GPT-4-Vision in streaming mode doesn't return the usage, so we have to compute it in the adapter: follow the pricing doc

ishaan-jaff commented 9 months ago

Hi @adubovik I'm the maintainer of LiteLLM https://github.com/BerriAI/litellm we allow you to do cost tracking for 100+ LLMs

Usage

Docs: https://docs.litellm.ai/docs/#calculate-costs-usage-latency

from litellm import completion, completion_cost
import os
os.environ["OPENAI_API_KEY"] = "your-api-key"

response = completion(
  model="gpt-3.5-turbo", 
  messages=[{ "content": "Hello, how are you?","role": "user"}]
)

cost = completion_cost(completion_response=response)
print("Cost for completion call with gpt-3.5-turbo: ", f"${float(cost):.10f}")

Usage for streaming

import litellm

# track_cost_callback 
def track_cost_callback(
    kwargs,                 # kwargs to completion
    completion_response,    # response from completion
    start_time, end_time    # start/end time
):
    try:
        # check if it has collected an entire stream response
        if "complete_streaming_response" in kwargs:
            # for tracking streaming cost we pass the "messages" and the output_text to litellm.completion_cost 
            completion_response=kwargs["complete_streaming_response"]
            input_text = kwargs["messages"]
            output_text = completion_response["choices"][0]["message"]["content"]
            response_cost = litellm.completion_cost(
                model = kwargs["model"],
                messages = input_text,
                completion=output_text
            )
            print("streaming response_cost", response_cost)
    except:
        pass
# set callback 
litellm.success_callback = [track_cost_callback] # set custom callback function

# litellm.completion() call
response = completion(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": "Hi 👋 - i'm openai"
        }
    ],
    stream=True
)

We also allow you to create a self hosted OpenAI Compatible proxy server to make your LLM calls (100+ LLMs), track costs, token usage Docs: https://docs.litellm.ai/docs/simple_proxy

I hope this is helpful, if not I'd love your feedback on what we can improve