[Bug]: Error calling completion on a deployed VertexAI Model Garden LLM endpoint

suresiva commented 1 week ago

What happened?

We have a Llama 3.1 8B model deployed from VertexAI Model Garden and made available for inference through model endpoint. It takes input in a specific format and generates output as given below,

JSON request,

curl \
-X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" \
"https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/endpoints/${ENDPOINT_ID}:predict" \
-d '{ "instances": [{"prompt": "What is machine learning?", "max_tokens": 100}] }'

Response

{
 "predictions": [
   "Prompt:\nWhat is machine learning?\nOutput:\n A broad introduction\nMachine learning is..."
 ],
 "deployedModelId": "xxxx",
 "model": "projects/xxxx/locations/us-central1/models/llama-3-1-8b-instruct-172858156xxxx",
 "modelDisplayName": "llama-3-1-8b-instruct-172858156xxxx",
 "modelVersionId": "1"
}

We are using LiteLLM v1.50.0-stable version and we tried to configure above deployed Llama 3.1 model on LiteLLM as below,

{
  "model_name": "vertex_ai/meta/llama3-8b-instruct-deployed",
  "litellm_params": {
    "vertex_project": "xxxxxxxxxxxxx",
    "vertex_location": "us-central1",
    "model": "vertex_ai/320911490117586xxxx"
  },
  ...
}

While making a completion call with a typical payload given below,

curl -X POST 'https://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-xxxxxxxx' \
-d '{ "model": "vertex_ai/meta/llama3-8b-instruct-deployed", "messages": [ { "role": "user",  "content": "What is the weather like in Boston today?"} ] }'

Getting a HTTP 500 error response from LiteLLM as given below,

Error occurred while generating model response. Please try again. Error: Error: 500 litellm.APIConnectionError: 500 Internal Server Error Traceback (most recent call last): 
File "/usr/local/lib/python3.11/site-packages/google/api_core/grpc_helpers_async.py", line 85,  in __await__ response = yield from self._call.__await__() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
File "/usr/local/lib/python3.11/site-packages/grpc/aio/_call.py", line 327, 
in __await__ raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.INTERNAL details = "Internal Server Error" debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.191.234:443 {grpc_message:"Internal Server Error", grpc_status:13, created_time:"2024-10-28T22:42:25.45051005+00:00"}" > The above exception was the direct cause of the following exception: Traceback (most recent call last): 
File "/usr/local/lib/python3.11/site-packages/litellm/main.py", line 455, 
in acompletion response = await init_response ^^^^^^^^^^^^^^^^^^^ 
File "/usr/local/lib/python3.11/site-packages/litellm/llms/vertex_ai_and_google_ai_studio/vertex_ai_non_gemini.py", line 1125, 
in async_streaming response_obj = await llm_model.predict( ^^^^^^^^^^^^^^^^^^^^^^^^ 
File "/usr/local/lib/python3.11/site-packages/google/cloud/aiplatform_v1/services/prediction_service/async_client.py", line 404, in predict response = await rpc( ^^^^^^^^^^ 
File "/usr/local/lib/python3.11/site-packages/google/api_core/grpc_helpers_async.py", line 88, in __await__ raise exceptions.from_grpc_error(rpc_error) from rpc_error google.api_core.exceptions.InternalServerError: 500 Internal Server Error Received Model Group=vertex_ai/meta/llama3-8b-instruct-deployed Available Model Group Fallbacks=None

While analyzing VertexAI model endpoint logs, found below error trace,

TypeError: SamplingParams.init() got an unexpected keyword argument 'max_retries'

ERROR 2024-10-28T22:49:37.335386991Z ERROR: Exception in ASGI application
ERROR 2024-10-28T22:49:37.335420608Z Traceback (most recent call last):
ERROR 2024-10-28T22:49:37.335427761Z File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
ERROR 2024-10-28T22:49:37.335432529Z result = await app( # type: ignore[func-returns-value]
ERROR 2024-10-28T22:49:37.335437297Z File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
ERROR 2024-10-28T22:49:37.335441827Z return await self.app(scope, receive, send)
ERROR 2024-10-28T22:49:37.335447311Z File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
ERROR 2024-10-28T22:49:37.335451364Z await super().__call__(scope, receive, send)
ERROR 2024-10-28T22:49:37.335455417Z File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
ERROR 2024-10-28T22:49:37.335459470Z await self.middleware_stack(scope, receive, send)
ERROR 2024-10-28T22:49:37.335464Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
ERROR 2024-10-28T22:49:37.335468053Z raise exc
ERROR 2024-10-28T22:49:37.335471868Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
ERROR 2024-10-28T22:49:37.335475683Z await self.app(scope, receive, _send)
ERROR 2024-10-28T22:49:37.335479497Z File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
ERROR 2024-10-28T22:49:37.335483789Z await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
ERROR 2024-10-28T22:49:37.335487604Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR 2024-10-28T22:49:37.335491418Z raise exc
ERROR 2024-10-28T22:49:37.335495471Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR 2024-10-28T22:49:37.335499286Z await app(scope, receive, sender)
ERROR 2024-10-28T22:49:37.335503339Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
ERROR 2024-10-28T22:49:37.335507154Z await self.middleware_stack(scope, receive, send)
ERROR 2024-10-28T22:49:37.335510969Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
ERROR 2024-10-28T22:49:37.335515022Z await route.handle(scope, receive, send)
ERROR 2024-10-28T22:49:37.335518836Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
ERROR 2024-10-28T22:49:37.335522651Z await self.app(scope, receive, send)
ERROR 2024-10-28T22:49:37.335526943Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
ERROR 2024-10-28T22:49:37.335531234Z await wrap_app_handling_exceptions(app, request)(scope, receive, send)
ERROR 2024-10-28T22:49:37.335549354Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
ERROR 2024-10-28T22:49:37.335553407Z raise exc
ERROR 2024-10-28T22:49:37.335557222Z File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
ERROR 2024-10-28T22:49:37.335561037Z await app(scope, receive, sender)
ERROR 2024-10-28T22:49:37.335565090Z File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
ERROR 2024-10-28T22:49:37.335569620Z response = await f(request)
ERROR 2024-10-28T22:49:37.335573673Z File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
ERROR 2024-10-28T22:49:37.335577726Z raw_response = await run_endpoint_function(
ERROR 2024-10-28T22:49:37.335581541Z File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
ERROR 2024-10-28T22:49:37.335585832Z return await dependant.call(**values)
ERROR 2024-10-28T22:49:37.335590124Z File "/workspace/vllm/vllm/entrypoints/api_server.py", line 176, in generate
ERROR 2024-10-28T22:49:37.335594177Z sampling_params = SamplingParams(**request_dict)
ERROR 2024-10-28T22:49:37.335597991Z TypeError: SamplingParams.__init__() got an unexpected keyword argument 'max_retries'

Relevant log output

No response

Twitter / LinkedIn details

No response

krrishdholakia commented 1 week ago

@suresiva we already support vertex ai llama on model garden. Please look at the relevant docs - https://docs.litellm.ai/docs/providers/vertex#llama-3-api

suresiva commented 1 week ago

@krrishdholakia, there are 2 ways we can deploy Llama 3.1 on Vertex AI.

Fully Managed API service
- Works well through LiteLLM, easy to configure.
- Given documentation - https://docs.litellm.ai/docs/providers/vertex#llama-3-api - applicable for this.
- Using model name while adding model into LiteLLM, (i.e. meta/llama3-405b-instruct-maas)
Self Deployed LLM endpoint from Model Garden
- Facing the error messages posted in this thread.
- Given document does not work for this setup
- Instead followed this documentation - https://docs.litellm.ai/docs/providers/vertex#model-garden
- Used endpoint_id to add the model in to LiteLLM, (i.e. vertex_ai/)

We are currently facing actual error posted on this thread while using the second option (self-deployed LLM endpoint in Model Garden). Please let us know how we can resolve the errors.

krrishdholakia commented 1 week ago

if you self deploy is it the same api spec? @suresiva

if so, it seems like we just need to let you specify this distinction - hey this is model follows the vertex/meta spec

suresiva commented 1 week ago

@krrishdholakia , self-deployed Llama 3.1 model follows different request/response spec, Request, { "instances": [{"prompt": "What is machine learning?", "max_tokens": 100}] } Response,

{
 "predictions": [
   "Prompt:\nWhat is machine learning?\nOutput:\n A broad introduction\nMachine learning is..."
 ],
 "deployedModelId": "xxxx",
 "model": "projects/xxxx/locations/us-central1/models/llama-3-1-8b-instruct-172858156xxxx",
 "modelDisplayName": "llama-3-1-8b-instruct-172858156xxxx",
 "modelVersionId": "1"
}

Behind the scenes, this self-deployed Llama 3.1 model is actually deployed through vllm.entrypoints.api_server entrypoint, which does not use the OpenAI's spec.

BerriAI / litellm