Increasing the timeout for InvokeEndpoint

velociraptor111 commented 5 years ago

The current timeout for InvokeEndpoint is 60 seconds as specified here: https://docs.aws.amazon.com/en_pv/sagemaker/latest/dg/API_runtime_InvokeEndpoint.html

Is there any way we can increase this limit, to say 120 seconds?

***EDIT

Just to be clear, I was able to keep the process on the server running by passing an environment variable in the Model definition like so

 model = MXNetModel(..., env = {'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '300' })

Through CloudWatch, I was able to confirm that the task is still running even after 60 seconds. (For my usecase, I am processing a video frame by frame) My question is however, on the client side I am receiving this kind of error due to the timeout


An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/nightingale-pose-estimation in account 552571371228 for more information.: ModelError

wenzhaoAtAws commented 5 years ago

Sagemaker model hosting engineer here. Thanks for your interest in our product!

With respect to your question -- currently, it is not possible to increase the the 60 seconds timeout.

velociraptor111 commented 5 years ago

@wenzhaoAtAws Are there any plans in the future to allow customers to increase inference timeout?

Also, I notice that in the response object of

response = sagemaker_client.invoke_endpoint(EndpointName='pose-estimation',Body=request_body)
print(response)

This is the log

{'ResponseMetadata': {'RequestId': 'a0343e2a-5390-4af0-a7fd-ef63d576ca45', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a0343e2a-5390-4af0-a7fd-ef63d576ca45', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 12 Nov 2019 01:30:32 GMT', 'content-type': 'application/json', 'content-length': '18'}, 'RetryAttempts': 2}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x10c60e358>}

Is it possible for me to make the RetryAttempts to 0?

sanjams2 commented 4 years ago

For disabling retries, you should be able to do something like the following (please note I havent tested this code myself, it serves as a reference):

import boto3
from botocore.config import Config
from sagemaker.session import Session

config = Config(
    read_timeout=80,
    retries={
        'max_attempts': 0
    }
)
sagemaker_runtime_client = boto3.client('sagemaker-runtime', config=config)
sagemaker_client = Session(sagemaker_runtime_client=sagemaker_runtime_client)

See:

In regards to the feature request, one option is to use SageMaker's batch transform option (https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html). May not fit your use case though...

hlzl commented 4 years ago

@ajaykarpur Any update on this? Switching to batch transforms doesn‘t seem to be interesting for video input.

tomaszdudek7 commented 3 years ago

I agree 60s limit is quite low and I hope they will slightly bump it.

If you need more than 1 minute (and less than 15) you might be interested in the newest SageMaker offering, namely Asynchronous Inference. The client sends the payload to the endpoint and the result will eventually appear in specified S3 bucket. You can set up SNS or Lambda to inform the client that it is ready to consume (and for example generate S3 presigned URL in the process).

More on that here: https://aws.amazon.com/about-aws/whats-new/2021/08/amazon-sagemaker-asynchronous-new-inference-option/

lminer commented 3 years ago

FYI Even using the async endpoint, I'm still getting timed out and the env variable mentioned at the top of this thread doesn't exist any more. This issue really needs to be fixed.

If anyone wants to know how to fix this, I create by own docker image from the default inference. Then I slightly modified serve.py and nginx.conf.template so that they had higher timeout values for gunicorn, tensorflow-serving, and nginx. Then I copied them over the default files.

whh14 commented 3 years ago

hi @lminer, could you elaborate how to modify the nginx.conf.template please? many thanks

lminer commented 3 years ago

@whh14 I changed it to this. Probably overkill!

load_module modules/ngx_http_js_module.so;

worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log  /dev/stderr %NGINX_LOG_LEVEL%;

worker_rlimit_nofile 4096;

events {
  worker_connections 2048;
}

http {
  include /etc/nginx/mime.types;
  default_type application/json;
  access_log /dev/stdout combined;
  js_include tensorflow-serving.js;
  fastcgi_read_timeout %MODEL_TIMEOUT%;
  proxy_read_timeout %MODEL_TIMEOUT%;

  upstream tfs_upstream {
    %TFS_UPSTREAM%;
  }

  upstream gunicorn_upstream {
    server unix:/tmp/gunicorn.sock fail_timeout=1;
  }

  server {
    listen %NGINX_HTTP_PORT% deferred;
    client_max_body_size 0;
    client_body_buffer_size 100m;
    subrequest_output_buffer_size 100m;

    set $tfs_version %TFS_VERSION%;
    set $default_tfs_model %TFS_DEFAULT_MODEL_NAME%;

    location /tfs {
        rewrite ^/tfs/(.*) /$1  break;
        proxy_redirect off;
        proxy_pass_request_headers off;
        proxy_set_header Content-Type 'application/json';
        proxy_set_header Accept 'application/json';
        proxy_pass http://tfs_upstream;
        proxy_read_timeout %MODEL_TIMEOUT%;
        proxy_connect_timeout %MODEL_TIMEOUT%;
        proxy_send_timeout %MODEL_TIMEOUT%;
        send_timeout %MODEL_TIMEOUT%;
    }

    location /ping {
        %FORWARD_PING_REQUESTS%;
    }

    location /invocations {
        %FORWARD_INVOCATION_REQUESTS%;
    }

    location /models {
        proxy_pass http://gunicorn_upstream/models;
        proxy_read_timeout %MODEL_TIMEOUT%;
        proxy_connect_timeout %MODEL_TIMEOUT%;
        proxy_send_timeout %MODEL_TIMEOUT%;
        send_timeout %MODEL_TIMEOUT%;
    }

    location / {
        return 404 '{"error": "Not Found"}';
    }

    keepalive_timeout 3;
  }
}

whh14 commented 3 years ago

Hi @lminer , Thanks a lot for your help. I have made those changes, and I also made some changes to the serve.py, but I am still having the timeout issue, (it timed out after 30 second while uploading data from inference.py to the model. Could you show me what modification you did to the serve.py file? Many thanks.

lminer commented 3 years ago

@whh14 I set the --timeout flag to gunicorn and the --rest_api_timeout_in_ms to tensorflow-serving. That being said, if it's timing out while uploading data to the model, then it seems like you have other problems you need to solve!

n0thing233 commented 2 years ago

any plan to work on this in your 2022 roadmap?

mullermp commented 2 years ago

Please add configuration for this!

iCHAIT commented 2 years ago

Any updates on this?

michaelrod64 commented 1 year ago

Any updates on this?

bharatTR commented 1 year ago

Any update ?

mirodrr commented 1 year ago

Any updates on this?

rafalbog commented 1 year ago

any update?

nuran commented 1 year ago

any update lol?

paultenfjord commented 1 year ago

Update?

jcnunezmoreno commented 11 months ago

any update?

mohanasudhan commented 11 months ago

SageMaker supports multiple timeout parameters to control the model, container loading and invoke call.

ContainerStartupHealthCheckTimeoutInSeconds - helps with container startup health check ModelDataDownloadTimeoutInSeconds - helps with model download timeout

(More details can be found in: https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-hosting.html)

And the model servers inside containers provide control on the invoke call. For example, SAGEMAKER_TS_RESPONSE_TIMEOUT and SAGEMAKER_MODEL_SERVER_TIMEOUT can help in controlling the invoke timeout.

Please reopen if you still have issue with timeout.

joelachance commented 11 months ago

I've only gotten the invoke timeouts to work after contacting AWS Support and having them configure those. I'll note they initially didn't want to do it either, so it's not quite as simple unfortunately.

No argument you may want to close this ticket, but I'd kindly suggest this is still an improvement that would be a really, really nice to have.

mirodrr commented 11 months ago

Yeah, I would re-open this. I feel like the person who closed this is misunderstanding the problem

mohanasudhan commented 11 months ago

Reopening the issue as requested. Can you detail the improvement/expectation?

mirodrr2 commented 11 months ago

The issue is described pretty well in the original post. I do not have anything to add: https://github.com/aws/sagemaker-python-sdk/issues/1119#issue-521175736

jcnunezmoreno commented 11 months ago

I solve the problem using asynchronous inference: https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html You don't need to change any part of the Docker/Code, you only have to configure the endpoint to work in async mode.

In any case would be interesting to be able to modify the time out of the real-time inference, even if it is limited to a maximun, but in the meanwhile asynchronous inference can be the solution.

hz-nm commented 6 months ago

Even async inference doesn't work when I try to stream response from an LLM. As soon as the response time hits 60s, it disconnects. I have tried every possible solution but unfortunately, the timeout error still exists.

hz-nm commented 6 months ago

I solve the problem using asynchronous inference: https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html You don't need to change any part of the Docker/Code, you only have to configure the endpoint to work in async mode.

In any case would be interesting to be able to modify the time out of the real-time inference, even if it is limited to a maximun, but in the meanwhile asynchronous inference can be the solution.

Doesn't work for me unfortunately still getting the timeout error.

DocIsenberg commented 5 months ago

@hz-nm I was also able to solve it by creating an async endpoint. I initially had trouble, and I think I first tried the same thing you did. There's actually two ways of doing this async, and only one works.

❌ You can query a realtime endpoint via (asynchronous?) streaming, via something like boto3.client("sagemaker-runtime").invoke_endpoint_with_response_stream. This didn't work for me: EventStreamError: An error occurred (ModelStreamError) when calling the InvokeEndpointWithResponseStream operation: Your model primary did not complete sending the inference response in the allotted time.
✅ You can create an asynchronous endpoint, which is what ended up working for me.

Notes on creating an asynchronous endpoint:

You could follow the AWS docs, but I found that to be too much boto3 for my tastes.
Instead, I found this blog post super helpful. It was more or less identical to the workflow I was already using, but just added the async_config to the .deploy call params.
I also added the following parameters to my env param dictionary in the HuggingFaceModel() call. These were from an old attempt to increase the timeout- I have no idea if they were necessary at all, but I know they didn't work alone. One of them may be an llm-generated solution and never have been valid.
- "SAGEMAKER_TS_RESPONSE_TIMEOUT": json.dumps(599),
- "SAGEMAKER_MODEL_SERVER_TIMEOUT": json.dumps(599),
- "SAGEMAKER_MODEL_INFERENCE_TIMEOUT": json.dumps(599),

Tag me if I can help further!

Manishthakur2503 commented 2 months ago

Hi @mohanasudhan @DocIsenberg @joelachance @lminer @ajaykarpur , I have deployed my Docker image to ECR, created a model, and configured an endpoint with 6GB of RAM for serverless inference in SageMaker. The endpoint works fine with shorter video inputs (e.g., 10 seconds), but when I send a 3-minute long video, I encounter a 500 error.

Initially, I suspected it was a memory issue, but upon checking, the memory utilization is only around 25%. I'm wondering if the error is related to a timeout issue.

Is this likely due to a timeout limit for serverless inference? If yes, is there a way to increase the timeout limit for serverless inference? Would increasing the timeout in my Docker container’s server file where I am using gunicorn help resolve this issue? Any guidance on resolving this would be appreciated!

Thank you.

alanwilter commented 2 weeks ago

Unbelievable, no wonder I resent using AWS every time I need to implement something new... Spent two days trying to figure out what was going on with my code that, even using GPUs, take like 10 min to run. I does all fine in the background, only the client that shuts after a minute. Absolutely ridiculous and frustrating! I will try async...

It does not make sense: Why AWS Lambda has 900 s timeout but SageMaker endpoint has only 60 s?

codeneobee commented 2 weeks ago

Async inference endpoints do not work with multi model endpoints either, you can not configure an endpoint to be both multi model capable and asynchronous. I am trying to deploy a multi model endpoint using the Nvidia Triton Inference container and when switching the models the request disconnects after 60 seconds as well.

cuiIRISA commented 1 day ago

Same here with multiple experiments using SageMaker endpoint with LMI with vllm backend: framework="djl-lmi", version="0.30.0"

1. SageMaker real-time invocation: when I use invoke_endpoint(), I got NO RESULTS within 60 seconds. An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Connection reset by peer for the endpoint-qwen2-vl-2024-11-18-01-08-03 endpoint. Please retry." 2 SageMaker streaming: when I changed to invoke_endpoint_with_response_stream(): I got streamed token output within 60 seconds, then right after 60 seconds I got An error occurred (ModelStreamError) when calling the InvokeEndpointWithResponseStream operation: Your model primary did not complete sending the inference response in the allotted time. 3 SageMaker Asynchronous inference: I also implemented async inference using LMI, but having no results in S3 and log with "failureReason":"ClientError: The response from container primary did not specify the required Content-Length header","

Seemed that 60 seconds limit is impacting long response

aws / sagemaker-python-sdk

Increasing the timeout for InvokeEndpoint #1119