Closed solyarisoftware closed 1 year ago
@solyarisoftware big cred for the well written issue, clear and concise 🙌.
The latency would be effected by network speeds, location, availability, current load on the API etc... So it isn't really an exact measurement, but could definitely be interesting to dive into.
@WilliamEspegren You are right: completion latency is an random variable depending on reason you mentioned, but it also depends on the "complexity" of the prompt (in part related to window context tokens total).
By the way, processing time is not an unpredictable random value, but it's a random variable that we can measure by example with his mean and standard deviation, in probabilistic theory parlance.
So, when you run a completion, having that sample latency time in milliseconds give you immediately the "weight" of the completion processing.
Consider these 2 scenarios:
simple prompt
prompt:
what's the capital of Italy?
completion:
The capital of Italy is Rome.
If I run ten times I got these latencies in milliseconds: [329, 333, 324, 263, 293, 261, 240, 238, 329, 295] (Mean: 290.5, Standard Deviation: 35.8503835404867). Broadly speaking the latency is around 290 msecs (in mean) with a pretty small standard deviation of 36 msecs.
complex prompt prompt:
TASK
As an amazing natural language sentences classifier, given an input sentence,
you classify the associated intent from a short lists of preset intents.
Examples
- i: vorrei sapere qual è lo stato del mio ticket numero |12345=TicketNumber|
o: {"intent":"TicketStatus","entities":{"TicketNumber":"IN-345"}}
- i: Ho aperto ieri 2 ticket: il |IN-345=TicketNumber| ed il |ON-876=TicketNumber|. Sono stati chiusi?
o: {"intent":"TicketStatus","entities":{"TicketNumber":["IN-345","ON-876"]}}
- i: mi dai i miei ultimi ticket aperti
o: {"intent":"TicketStatus"}
- i: quali ticket ho aperto?
o: {"intent":"TicketStatus"}
- i: Ho un problema sul monitor. Non si accende. Il monitor è un |HP345=Product| ed il computer credo sia un |asus 33=Product|.
o: {"intent":"IssueReport","entities":{"Description":"Ho un problema sul monitor. Non si accende","Product":["HP345", "asus33"]}}
- i: |non trovo più il programma per accedere alla RILAT=Description|. Qual'è l'indirizzo?. Mi aiuti?
o: {"intent":"IssueReport","entities":{"Description":"non trovo più il programma per accedere alla RILAT"}}
- i: |Non accedo ad internet da sta mattina=Description|. Cosa devo fare?
o: {"intent":"IssueReport","entities":{"Description":"Non accedo ad internet da sta mattina"}}
- i: come faccio ad andare al lavoro a piedi?
o: {"intent":"OutOfScope"}
- i: come faccio gli sapghetti alla carbonara?
o: {"intent":"OutOfScope"}
- i: come faccio ad aprire una segnalazione?
o: {"intent":"GeneralHelp"}
- i: il computer dei problemiche prò poi si sono risolti. Devo procedere? Che faccio?
o: {"NotUnderstand"}
Input
Dammi lo stato del mio ultimo ticket IN00984
Output
completion:
{"intent":"TicketStatus","entities":{"TicketNumber":"IN00984"}}
If I run ten times I got these latencies in milliseconds: [672, 800, 474, 411, 425, 440, 525, 1064, 353, 1494] (Mean: 665.8, Standard Deviation: 345.090654756109). Broadly speaking the latency is around 666 msecs (in mean) with a 50% of standard deviation (345 msecs).
These simple examples (by the way using an Azure OpenAI deployment) maybe demonstrate the relation between the latency and the "complexity" (~= token len?) of a given prompt completion. All in all latency measure the LLM computation time and become critical in interactive applications, maybe using composite LLM calls, where the overall latency become the sum of all latencies.
@solyarisoftware if i understand your problem - you're trying to estimate how much time a similar completion run might take.
Why not just do:
import time
start_time = time.start()
completion(..)
end_time = time.end()
latency = end_time - start_time
Of course, a completion decorator function helps track latency, but at the application usage level. That's bad, in my opinion, in terms of the readability of the final application (which may involve many completions)
BtW, in your pseudocode, it's time.time()
Here a possible implementation with a python decorator:
import time
import random
def latency(func):
''' decorator for latency calculation. Latency attribute is added to the function dictioanry result. Time is calculated in msecs '''
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
# Calculate the latency in milliseconds
latency_ms = int((end_time - start_time) * 1000)
# Add the latency attribute to the result dictionary
result['latency'] = latency_ms
return result
return wrapper
# Apply the @latency decorator to the completion function
@latency
def completion():
''' fake a LLM completion '''
# Sleep for a random number of seconds between 0.1 and 1.9
sleep_duration = random.uniform(0.1, 1.9)
time.sleep(sleep_duration)
# Create and return the dictionary
result = {
'choices': [
{
'finish_reason': 'stop',
'index': 0,
'message': {
'role': 'assistant',
'content': "I'm doing well, thank you for asking. I am Claude, an AI assistant created by Anthropic."
}
}
],
'created': 1691429984.3852863,
'model': 'claude-instant-1',
'usage': {
'prompt_tokens': 18,
'completion_tokens': 23,
'total_tokens': 41
}
}
return result
"""
if __name__ == "__main__":
# Apply the @latency decorator to the specific instance of completion
completion_with_latency = latency(completion)
# Test the decorated function
result = completion_with_latency()
print(result)
"""
if __name__ == "__main__":
# Test the decorated function
result = completion()
print(result)
{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': "I'm doing well, thank you for asking. I am Claude, an AI assistant created by Anthropic."}}], 'created': 1691429984.3852863, 'model': 'claude-instant-1', 'usage': {'prompt_tokens': 18, 'completion_tokens': 23, 'total_tokens': 41}, 'latency': 402}
@solyarisoftware this looks awesome! are you planning on using your decorator function ?
Well, the idea is to integrate LiteLLm on my prompter.vim vim plugin project.
As I shared here: https://github.com/BerriAI/litellm/issues/306 So latancy, throughtput and more could be some metrics that integrate the completion data.
My doubt is if the decorator way is the correct one, considering a possible chain of decorators. Not sure to be honest (I'm not a python expert).
@solyarisoftware what do you need from us to integrate LiteLLM to prompter.vim ?
I just need my time :) As soon I'll migrate to LiteLLM of course I'll notify you.
Investigated this further.
Openai response objects return the latency via response.response_ms
I think we could do something similar.
Yes, that was my original point. However, I double-checked, and OpenAI (chat) completion responses do not include the attribute 'response_ms,' at least when rereading the completion object doc. Perhaps you are referring to another LLM provider response format?
Closing this as it's now added.
def test_completion_ai21():
model_name = "j2-light"
try:
response = completion(model=model_name, messages=messages)
print(response["response_ms"]
except Exception as e:
pytest.fail(f"Error occurred: {e}")
Hi @krrishdholakia
I double checked today and the completion object in liteLLM DO NOT include reponse_ms
attribute, at least when using Azure Openai Models:
$ cat completion.py
from litellm import completion
user_message_content = "Hello, how are you?"
response = completion(
model="azure/gpt-35-turbo",
messages=[{"content": user_message_content, "role": "user"}]
)
print(response)
{
"id": "chatcmpl-8AakyT0ONGJOE44jKn0KRcYmiZJdt",
"object": "chat.completion",
"created": 1697535264,
"model": "gpt-35-turbo",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "As an AI language model, I don't have feelings, but I'm functioning well. How can I assist you today?"
}
}
],
"usage": {
"completion_tokens": 25,
"prompt_tokens": 14,
"total_tokens": 39
}
}
BTW, The same happens with text_completion() functions.
Responses never include the attribute response_ms
.
Thanks giorgio
@solyarisoftware please print response.response_ms
. It's a private variable, like how openai does it.
I agree - since we have to reformat the output for text_completion, this information is lost. we can do better here.
The Feature
My proposal is to add latency attribute in completion output response format as described here: https://docs.litellm.ai/docs/completion/output
latency or elapsed or response time is the time (in milliseconds, so an integer value) that the single completion API run takes.
So by example this completion JSON response:
just adding the 'latency' attribute, it could becomes:
Motivation, pitch
As discussed here: https://github.com/BerriAI/litellm/issues/306, latency is one of the fundamental metric that "physically" measure any LLM completion, or any elaboration engine. Along with tokens consumption is a basic parameter to measure / comapre any LLM generation.
This addition is minimal, non-intrusive and back-compatible. The implementation is trivial as I done here: https://github.com/solyarisoftware/prompter.vim/blob/master/python/calculate_latency.py
To be picky, tracing latency introduces a minimal elaboration time << 2 msecs) that's negligible considering latency times are at least some hundreds of milliseconds on cloud powerfule deployments. And above all if we consider the benefits to have this metrics and successive elaboration/statistics.
BTW, Having the latency of each completion run of a certain LLM setting, could be also helpful to estimate in advance the latency of a similar setting run...
Twitter / LinkedIn details
twitter: @solyarisoftare linkedin: www.linkedin.com/in/giorgiorobino