BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
13.56k stars 1.59k forks source link

[Bug]: Error calling completion on a deployed VertexAI Model Garden LLM endpoint #6480

Open suresiva opened 1 week ago

suresiva commented 1 week ago

What happened?

We have a Llama 3.1 8B model deployed from VertexAI Model Garden and made available for inference through model endpoint. It takes input in a specific format and generates output as given below,

JSON request,

curl \
-X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" \
"https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/endpoints/${ENDPOINT_ID}:predict" \
-d '{ "instances": [{"prompt": "What is machine learning?", "max_tokens": 100}] }'

Response

{
 "predictions": [
   "Prompt:\nWhat is machine learning?\nOutput:\n A broad introduction\nMachine learning is..."
 ],
 "deployedModelId": "xxxx",
 "model": "projects/xxxx/locations/us-central1/models/llama-3-1-8b-instruct-172858156xxxx",
 "modelDisplayName": "llama-3-1-8b-instruct-172858156xxxx",
 "modelVersionId": "1"
}

We are using LiteLLM v1.50.0-stable version and we tried to configure above deployed Llama 3.1 model on LiteLLM as below,

{
  "model_name": "vertex_ai/meta/llama3-8b-instruct-deployed",
  "litellm_params": {
    "vertex_project": "xxxxxxxxxxxxx",
    "vertex_location": "us-central1",
    "model": "vertex_ai/320911490117586xxxx"
  },
  ...
}

While making a completion call with a typical payload given below,

curl -X POST 'https://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-xxxxxxxx' \
-d '{ "model": "vertex_ai/meta/llama3-8b-instruct-deployed", "messages": [ { "role": "user",  "content": "What is the weather like in Boston today?"} ] }'

Getting a HTTP 500 error response from LiteLLM as given below,

Error occurred while generating model response. Please try again. Error: Error: 500 litellm.APIConnectionError: 500 Internal Server Error Traceback (most recent call last): 
File "/usr/local/lib/python3.11/site-packages/google/api_core/grpc_helpers_async.py", line 85,  in __await__ response = yield from self._call.__await__() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
File "/usr/local/lib/python3.11/site-packages/grpc/aio/_call.py", line 327, 
in __await__ raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.INTERNAL details = "Internal Server Error" debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.191.234:443 {grpc_message:"Internal Server Error", grpc_status:13, created_time:"2024-10-28T22:42:25.45051005+00:00"}" > The above exception was the direct cause of the following exception: Traceback (most recent call last): 
File "/usr/local/lib/python3.11/site-packages/litellm/main.py", line 455, 
in acompletion response = await init_response ^^^^^^^^^^^^^^^^^^^ 
File "/usr/local/lib/python3.11/site-packages/litellm/llms/vertex_ai_and_google_ai_studio/vertex_ai_non_gemini.py", line 1125, 
in async_streaming response_obj = await llm_model.predict( ^^^^^^^^^^^^^^^^^^^^^^^^ 
File "/usr/local/lib/python3.11/site-packages/google/cloud/aiplatform_v1/services/prediction_service/async_client.py", line 404, in predict response = await rpc( ^^^^^^^^^^ 
File "/usr/local/lib/python3.11/site-packages/google/api_core/grpc_helpers_async.py", line 88, in __await__ raise exceptions.from_grpc_error(rpc_error) from rpc_error google.api_core.exceptions.InternalServerError: 500 Internal Server Error Received Model Group=vertex_ai/meta/llama3-8b-instruct-deployed Available Model Group Fallbacks=None

While analyzing VertexAI model endpoint logs, found below error trace,

TypeError: SamplingParams.init() got an unexpected keyword argument 'max_retries'

ERROR 2024-10-28T22:49:37.335386991Z ERROR: Exception in ASGI application
ERROR 2024-10-28T22:49:37.335420608Z Traceback (most recent call last):
ERROR 2024-10-28T22:49:37.335427761Z File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
ERROR 2024-10-28T22:49:37.335432529Z result = await app( # type: ignore[func-returns-value]
ERROR 2024-10-28T22:49:37.335437297Z File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
ERROR 2024-10-28T22:49:37.335441827Z return await self.app(scope, receive, send)
ERROR 2024-10-28T22:49:37.335447311Z File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
ERROR 2024-10-28T22:49:37.335451364Z await super().__call__(scope, receive, send)
ERROR 2024-10-28T22:49:37.335455417Z File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
ERROR 2024-10-28T22:49:37.335459470Z await self.middleware_stack(scope, receive, send)
ERROR 2024-10-28T22:49:37.335464Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
ERROR 2024-10-28T22:49:37.335468053Z raise exc
ERROR 2024-10-28T22:49:37.335471868Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
ERROR 2024-10-28T22:49:37.335475683Z await self.app(scope, receive, _send)
ERROR 2024-10-28T22:49:37.335479497Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
ERROR 2024-10-28T22:49:37.335483789Z await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
ERROR 2024-10-28T22:49:37.335487604Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR 2024-10-28T22:49:37.335491418Z raise exc
ERROR 2024-10-28T22:49:37.335495471Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR 2024-10-28T22:49:37.335499286Z await app(scope, receive, sender)
ERROR 2024-10-28T22:49:37.335503339Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
ERROR 2024-10-28T22:49:37.335507154Z await self.middleware_stack(scope, receive, send)
ERROR 2024-10-28T22:49:37.335510969Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
ERROR 2024-10-28T22:49:37.335515022Z await route.handle(scope, receive, send)
ERROR 2024-10-28T22:49:37.335518836Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
ERROR 2024-10-28T22:49:37.335522651Z await self.app(scope, receive, send)
ERROR 2024-10-28T22:49:37.335526943Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
ERROR 2024-10-28T22:49:37.335531234Z await wrap_app_handling_exceptions(app, request)(scope, receive, send)
ERROR 2024-10-28T22:49:37.335549354Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR 2024-10-28T22:49:37.335553407Z raise exc
ERROR 2024-10-28T22:49:37.335557222Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR 2024-10-28T22:49:37.335561037Z await app(scope, receive, sender)
ERROR 2024-10-28T22:49:37.335565090Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
ERROR 2024-10-28T22:49:37.335569620Z response = await f(request)
ERROR 2024-10-28T22:49:37.335573673Z File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
ERROR 2024-10-28T22:49:37.335577726Z raw_response = await run_endpoint_function(
ERROR 2024-10-28T22:49:37.335581541Z File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
ERROR 2024-10-28T22:49:37.335585832Z return await dependant.call(**values)
ERROR 2024-10-28T22:49:37.335590124Z File "/workspace/vllm/vllm/entrypoints/api_server.py", line 176, in generate
ERROR 2024-10-28T22:49:37.335594177Z sampling_params = SamplingParams(**request_dict)
ERROR 2024-10-28T22:49:37.335597991Z TypeError: SamplingParams.__init__() got an unexpected keyword argument 'max_retries'

Relevant log output

No response

Twitter / LinkedIn details

No response

krrishdholakia commented 1 week ago

@suresiva we already support vertex ai llama on model garden. Please look at the relevant docs - https://docs.litellm.ai/docs/providers/vertex#llama-3-api

suresiva commented 1 week ago

@krrishdholakia, there are 2 ways we can deploy Llama 3.1 on Vertex AI.

  1. Fully Managed API service
  2. Self Deployed LLM endpoint from Model Garden

We are currently facing actual error posted on this thread while using the second option (self-deployed LLM endpoint in Model Garden). Please let us know how we can resolve the errors.

krrishdholakia commented 1 week ago

if you self deploy is it the same api spec? @suresiva

if so, it seems like we just need to let you specify this distinction - hey this is model follows the vertex/meta spec

suresiva commented 1 week ago

@krrishdholakia , self-deployed Llama 3.1 model follows different request/response spec, Request, { "instances": [{"prompt": "What is machine learning?", "max_tokens": 100}] } Response,

{
 "predictions": [
   "Prompt:\nWhat is machine learning?\nOutput:\n A broad introduction\nMachine learning is..."
 ],
 "deployedModelId": "xxxx",
 "model": "projects/xxxx/locations/us-central1/models/llama-3-1-8b-instruct-172858156xxxx",
 "modelDisplayName": "llama-3-1-8b-instruct-172858156xxxx",
 "modelVersionId": "1"
}

Behind the scenes, this self-deployed Llama 3.1 model is actually deployed through vllm.entrypoints.api_server entrypoint, which does not use the OpenAI's spec.

image