BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
13.18k stars 1.54k forks source link

[Feature]: add latency in text_completion output format #389

Closed solyarisoftware closed 1 year ago

solyarisoftware commented 1 year ago

The Feature

My proposal is to add latency attribute in completion output response format as described here: https://docs.litellm.ai/docs/completion/output

latency or elapsed or response time is the time (in milliseconds, so an integer value) that the single completion API run takes.

So by example this completion JSON response:

{
  'choices': [
     {
        'finish_reason': 'stop',
        'index': 0,
        'message': {
           'role': 'assistant',
            'content': " I'm doing well, thank you for asking. I am Claude, an AI assistant created by Anthropic."
        }
      }
    ],
 'created': 1691429984.3852863,
 'model': 'claude-instant-1',
 'usage': {'prompt_tokens': 18, 'completion_tokens': 23, 'total_tokens': 41}
}

just adding the 'latency' attribute, it could becomes:

{
  'choices': [
     {
        'finish_reason': 'stop',
        'index': 0,
        'message': {
           'role': 'assistant',
            'content': " I'm doing well, thank you for asking. I am Claude, an AI assistant created by Anthropic."
        }
      }
    ],
 'created': 1691429984.3852863,
 'model': 'claude-instant-1',
 'usage': {'prompt_tokens': 18, 'completion_tokens': 23, 'total_tokens': 41},
 'latency': 452
}

Motivation, pitch

As discussed here: https://github.com/BerriAI/litellm/issues/306, latency is one of the fundamental metric that "physically" measure any LLM completion, or any elaboration engine. Along with tokens consumption is a basic parameter to measure / comapre any LLM generation.

This addition is minimal, non-intrusive and back-compatible. The implementation is trivial as I done here: https://github.com/solyarisoftware/prompter.vim/blob/master/python/calculate_latency.py

To be picky, tracing latency introduces a minimal elaboration time << 2 msecs) that's negligible considering latency times are at least some hundreds of milliseconds on cloud powerfule deployments. And above all if we consider the benefits to have this metrics and successive elaboration/statistics.

BTW, Having the latency of each completion run of a certain LLM setting, could be also helpful to estimate in advance the latency of a similar setting run...

Twitter / LinkedIn details

twitter: @solyarisoftare linkedin: www.linkedin.com/in/giorgiorobino

WilliamEspegren commented 1 year ago

@solyarisoftware big cred for the well written issue, clear and concise 🙌.

The latency would be effected by network speeds, location, availability, current load on the API etc... So it isn't really an exact measurement, but could definitely be interesting to dive into.

solyarisoftware commented 1 year ago

@WilliamEspegren You are right: completion latency is an random variable depending on reason you mentioned, but it also depends on the "complexity" of the prompt (in part related to window context tokens total).

By the way, processing time is not an unpredictable random value, but it's a random variable that we can measure by example with his mean and standard deviation, in probabilistic theory parlance.


So, when you run a completion, having that sample latency time in milliseconds give you immediately the "weight" of the completion processing.

Consider these 2 scenarios:

  1. simple prompt prompt: what's the capital of Italy? completion: The capital of Italy is Rome.

    If I run ten times I got these latencies in milliseconds: [329, 333, 324, 263, 293, 261, 240, 238, 329, 295] (Mean: 290.5, Standard Deviation: 35.8503835404867). Broadly speaking the latency is around 290 msecs (in mean) with a pretty small standard deviation of 36 msecs.

  2. complex prompt prompt:

    TASK 
    As an amazing natural language sentences classifier, given an input sentence,
    you classify the associated intent from a short lists of preset intents.
    
    Examples
    - i: vorrei sapere qual è lo stato del mio ticket numero |12345=TicketNumber|
     o: {"intent":"TicketStatus","entities":{"TicketNumber":"IN-345"}}
    - i: Ho aperto ieri 2 ticket: il |IN-345=TicketNumber| ed il |ON-876=TicketNumber|. Sono stati chiusi?
     o: {"intent":"TicketStatus","entities":{"TicketNumber":["IN-345","ON-876"]}}
    - i: mi dai i miei ultimi ticket aperti
     o: {"intent":"TicketStatus"}
    - i: quali ticket ho aperto?
     o: {"intent":"TicketStatus"}
    - i: Ho un problema sul monitor. Non si accende. Il monitor è un |HP345=Product| ed il computer credo sia un |asus 33=Product|.
     o: {"intent":"IssueReport","entities":{"Description":"Ho un problema sul monitor. Non si accende","Product":["HP345", "asus33"]}}
    - i: |non trovo più il programma per accedere alla RILAT=Description|. Qual'è l'indirizzo?. Mi aiuti?
     o: {"intent":"IssueReport","entities":{"Description":"non trovo più il programma per accedere alla RILAT"}}
    - i: |Non accedo ad internet da sta mattina=Description|. Cosa devo fare?
     o: {"intent":"IssueReport","entities":{"Description":"Non accedo ad internet da sta mattina"}}
    - i: come faccio ad andare al lavoro a piedi?
     o: {"intent":"OutOfScope"}
    - i: come faccio gli sapghetti alla carbonara?
     o: {"intent":"OutOfScope"}
    - i: come faccio ad aprire una segnalazione?
     o: {"intent":"GeneralHelp"}
    - i: il computer dei problemiche prò poi si sono risolti. Devo procedere? Che faccio?
     o: {"NotUnderstand"}
    
    Input
    Dammi lo stato del mio ultimo ticket IN00984
    
    Output

    completion:

    {"intent":"TicketStatus","entities":{"TicketNumber":"IN00984"}}

    If I run ten times I got these latencies in milliseconds: [672, 800, 474, 411, 425, 440, 525, 1064, 353, 1494] (Mean: 665.8, Standard Deviation: 345.090654756109). Broadly speaking the latency is around 666 msecs (in mean) with a 50% of standard deviation (345 msecs).


These simple examples (by the way using an Azure OpenAI deployment) maybe demonstrate the relation between the latency and the "complexity" (~= token len?) of a given prompt completion. All in all latency measure the LLM computation time and become critical in interactive applications, maybe using composite LLM calls, where the overall latency become the sum of all latencies.

krrishdholakia commented 1 year ago

@solyarisoftware if i understand your problem - you're trying to estimate how much time a similar completion run might take.

Why not just do:

import time 

start_time = time.start() 
completion(..)
end_time = time.end() 

latency = end_time - start_time
solyarisoftware commented 1 year ago

Of course, a completion decorator function helps track latency, but at the application usage level. That's bad, in my opinion, in terms of the readability of the final application (which may involve many completions)

BtW, in your pseudocode, it's time.time()

solyarisoftware commented 1 year ago

Here a possible implementation with a python decorator:

import time
import random

def latency(func):
    ''' decorator for latency calculation. Latency attribute is added to the function dictioanry result. Time is calculated in msecs '''
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()

        # Calculate the latency in milliseconds
        latency_ms = int((end_time - start_time) * 1000)

        # Add the latency attribute to the result dictionary
        result['latency'] = latency_ms
        return result

    return wrapper

# Apply the @latency decorator to the completion function
@latency
def completion():
    ''' fake a LLM completion '''
    # Sleep for a random number of seconds between 0.1 and 1.9
    sleep_duration = random.uniform(0.1, 1.9)
    time.sleep(sleep_duration)

    # Create and return the dictionary
    result = {
        'choices': [
            {
                'finish_reason': 'stop',
                'index': 0,
                'message': {
                    'role': 'assistant',
                    'content': "I'm doing well, thank you for asking. I am Claude, an AI assistant created by Anthropic."
                }
            }
        ],
        'created': 1691429984.3852863,
        'model': 'claude-instant-1',
        'usage': {
            'prompt_tokens': 18,
            'completion_tokens': 23,
            'total_tokens': 41
        }
    }

    return result

"""
if __name__ == "__main__":
    # Apply the @latency decorator to the specific instance of completion
    completion_with_latency = latency(completion)

    # Test the decorated function
    result = completion_with_latency()
    print(result)
"""
if __name__ == "__main__":
    # Test the decorated function
    result = completion()
    print(result)
{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': "I'm doing well, thank you for asking. I am Claude, an AI assistant created by Anthropic."}}], 'created': 1691429984.3852863, 'model': 'claude-instant-1', 'usage': {'prompt_tokens': 18, 'completion_tokens': 23, 'total_tokens': 41}, 'latency': 402}
ishaan-jaff commented 1 year ago

@solyarisoftware this looks awesome! are you planning on using your decorator function ?

solyarisoftware commented 1 year ago

Well, the idea is to integrate LiteLLm on my prompter.vim vim plugin project.

As I shared here: https://github.com/BerriAI/litellm/issues/306 So latancy, throughtput and more could be some metrics that integrate the completion data.

My doubt is if the decorator way is the correct one, considering a possible chain of decorators. Not sure to be honest (I'm not a python expert).

ishaan-jaff commented 1 year ago

@solyarisoftware what do you need from us to integrate LiteLLM to prompter.vim ?

solyarisoftware commented 1 year ago

I just need my time :) As soon I'll migrate to LiteLLM of course I'll notify you.

krrishdholakia commented 1 year ago

Investigated this further.

Openai response objects return the latency via response.response_ms

I think we could do something similar.

solyarisoftware commented 1 year ago

Yes, that was my original point. However, I double-checked, and OpenAI (chat) completion responses do not include the attribute 'response_ms,' at least when rereading the completion object doc. Perhaps you are referring to another LLM provider response format?

krrishdholakia commented 1 year ago

Closing this as it's now added.

def test_completion_ai21():
    model_name = "j2-light"
    try:
        response = completion(model=model_name, messages=messages)
        print(response["response_ms"]
    except Exception as e:
        pytest.fail(f"Error occurred: {e}")
solyarisoftware commented 1 year ago

Hi @krrishdholakia

I double checked today and the completion object in liteLLM DO NOT include reponse_ms attribute, at least when using Azure Openai Models:

$ cat completion.py

from litellm import completion

user_message_content = "Hello, how are you?"

response = completion(
    model="azure/gpt-35-turbo",
    messages=[{"content": user_message_content, "role": "user"}]
)

print(response)
{
  "id": "chatcmpl-8AakyT0ONGJOE44jKn0KRcYmiZJdt",
  "object": "chat.completion",
  "created": 1697535264,
  "model": "gpt-35-turbo",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "As an AI language model, I don't have feelings, but I'm functioning well. How can I assist you today?"
      }
    }
  ],
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 14,
    "total_tokens": 39
  }
}

BTW, The same happens with text_completion() functions. Responses never include the attribute response_ms.

Thanks giorgio

krrishdholakia commented 1 year ago

@solyarisoftware please print response.response_ms. It's a private variable, like how openai does it.

krrishdholakia commented 1 year ago

Docs - https://docs.litellm.ai/docs/completion/output#additional-attributes

krrishdholakia commented 1 year ago

I agree - since we have to reformat the output for text_completion, this information is lost. we can do better here.