Open velociraptor111 opened 5 years ago
Sagemaker model hosting engineer here. Thanks for your interest in our product!
With respect to your question -- currently, it is not possible to increase the the 60 seconds timeout.
@wenzhaoAtAws Are there any plans in the future to allow customers to increase inference timeout?
Also, I notice that in the response object of
response = sagemaker_client.invoke_endpoint(EndpointName='pose-estimation',Body=request_body)
print(response)
This is the log
{'ResponseMetadata': {'RequestId': 'a0343e2a-5390-4af0-a7fd-ef63d576ca45', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a0343e2a-5390-4af0-a7fd-ef63d576ca45', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 12 Nov 2019 01:30:32 GMT', 'content-type': 'application/json', 'content-length': '18'}, 'RetryAttempts': 2}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x10c60e358>}
Is it possible for me to make the RetryAttempts to 0?
For disabling retries, you should be able to do something like the following (please note I havent tested this code myself, it serves as a reference):
import boto3
from botocore.config import Config
from sagemaker.session import Session
config = Config(
read_timeout=80,
retries={
'max_attempts': 0
}
)
sagemaker_runtime_client = boto3.client('sagemaker-runtime', config=config)
sagemaker_client = Session(sagemaker_runtime_client=sagemaker_runtime_client)
See:
In regards to the feature request, one option is to use SageMaker's batch transform option (https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html). May not fit your use case though...
@ajaykarpur Any update on this? Switching to batch transforms doesn‘t seem to be interesting for video input.
I agree 60s limit is quite low and I hope they will slightly bump it.
If you need more than 1 minute (and less than 15) you might be interested in the newest SageMaker offering, namely Asynchronous Inference. The client sends the payload to the endpoint and the result will eventually appear in specified S3 bucket. You can set up SNS or Lambda to inform the client that it is ready to consume (and for example generate S3 presigned URL in the process).
More on that here: https://aws.amazon.com/about-aws/whats-new/2021/08/amazon-sagemaker-asynchronous-new-inference-option/
FYI Even using the async endpoint, I'm still getting timed out and the env variable mentioned at the top of this thread doesn't exist any more. This issue really needs to be fixed.
If anyone wants to know how to fix this, I create by own docker image from the default inference. Then I slightly modified serve.py
and nginx.conf.template
so that they had higher timeout values for gunicorn, tensorflow-serving, and nginx. Then I copied them over the default files.
hi @lminer, could you elaborate how to modify the nginx.conf.template please? many thanks
@whh14 I changed it to this. Probably overkill!
load_module modules/ngx_http_js_module.so;
worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log /dev/stderr %NGINX_LOG_LEVEL%;
worker_rlimit_nofile 4096;
events {
worker_connections 2048;
}
http {
include /etc/nginx/mime.types;
default_type application/json;
access_log /dev/stdout combined;
js_include tensorflow-serving.js;
fastcgi_read_timeout %MODEL_TIMEOUT%;
proxy_read_timeout %MODEL_TIMEOUT%;
upstream tfs_upstream {
%TFS_UPSTREAM%;
}
upstream gunicorn_upstream {
server unix:/tmp/gunicorn.sock fail_timeout=1;
}
server {
listen %NGINX_HTTP_PORT% deferred;
client_max_body_size 0;
client_body_buffer_size 100m;
subrequest_output_buffer_size 100m;
set $tfs_version %TFS_VERSION%;
set $default_tfs_model %TFS_DEFAULT_MODEL_NAME%;
location /tfs {
rewrite ^/tfs/(.*) /$1 break;
proxy_redirect off;
proxy_pass_request_headers off;
proxy_set_header Content-Type 'application/json';
proxy_set_header Accept 'application/json';
proxy_pass http://tfs_upstream;
proxy_read_timeout %MODEL_TIMEOUT%;
proxy_connect_timeout %MODEL_TIMEOUT%;
proxy_send_timeout %MODEL_TIMEOUT%;
send_timeout %MODEL_TIMEOUT%;
}
location /ping {
%FORWARD_PING_REQUESTS%;
}
location /invocations {
%FORWARD_INVOCATION_REQUESTS%;
}
location /models {
proxy_pass http://gunicorn_upstream/models;
proxy_read_timeout %MODEL_TIMEOUT%;
proxy_connect_timeout %MODEL_TIMEOUT%;
proxy_send_timeout %MODEL_TIMEOUT%;
send_timeout %MODEL_TIMEOUT%;
}
location / {
return 404 '{"error": "Not Found"}';
}
keepalive_timeout 3;
}
}
Hi @lminer , Thanks a lot for your help. I have made those changes, and I also made some changes to the serve.py, but I am still having the timeout issue, (it timed out after 30 second while uploading data from inference.py to the model. Could you show me what modification you did to the serve.py file? Many thanks.
@whh14 I set the --timeout flag to gunicorn and the --rest_api_timeout_in_ms to tensorflow-serving. That being said, if it's timing out while uploading data to the model, then it seems like you have other problems you need to solve!
any plan to work on this in your 2022 roadmap?
Please add configuration for this!
Any updates on this?
Any updates on this?
Any update ?
Any updates on this?
any update?
any update lol?
Update?
any update?
SageMaker supports multiple timeout parameters to control the model, container loading and invoke call.
ContainerStartupHealthCheckTimeoutInSeconds - helps with container startup health check ModelDataDownloadTimeoutInSeconds - helps with model download timeout
(More details can be found in: https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-hosting.html)
And the model servers inside containers provide control on the invoke call. For example, SAGEMAKER_TS_RESPONSE_TIMEOUT and SAGEMAKER_MODEL_SERVER_TIMEOUT can help in controlling the invoke timeout.
Please reopen if you still have issue with timeout.
I've only gotten the invoke timeouts to work after contacting AWS Support and having them configure those. I'll note they initially didn't want to do it either, so it's not quite as simple unfortunately.
No argument you may want to close this ticket, but I'd kindly suggest this is still an improvement that would be a really, really nice to have.
Yeah, I would re-open this. I feel like the person who closed this is misunderstanding the problem
Reopening the issue as requested. Can you detail the improvement/expectation?
The issue is described pretty well in the original post. I do not have anything to add: https://github.com/aws/sagemaker-python-sdk/issues/1119#issue-521175736
I solve the problem using asynchronous inference: https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html You don't need to change any part of the Docker/Code, you only have to configure the endpoint to work in async mode.
In any case would be interesting to be able to modify the time out of the real-time inference, even if it is limited to a maximun, but in the meanwhile asynchronous inference can be the solution.
Even async inference doesn't work when I try to stream response from an LLM. As soon as the response time hits 60s, it disconnects. I have tried every possible solution but unfortunately, the timeout error still exists.
I solve the problem using asynchronous inference: https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html You don't need to change any part of the Docker/Code, you only have to configure the endpoint to work in async mode.
In any case would be interesting to be able to modify the time out of the real-time inference, even if it is limited to a maximun, but in the meanwhile asynchronous inference can be the solution.
Doesn't work for me unfortunately still getting the timeout error.
@hz-nm I was also able to solve it by creating an async endpoint. I initially had trouble, and I think I first tried the same thing you did. There's actually two ways of doing this async, and only one works.
boto3.client("sagemaker-runtime").invoke_endpoint_with_response_stream
. This didn't work for me: EventStreamError: An error occurred (ModelStreamError) when calling the InvokeEndpointWithResponseStream operation: Your model primary did not complete sending the inference response in the allotted time.
Notes on creating an asynchronous endpoint:
async_config
to the .deploy
call params.env
param dictionary in the HuggingFaceModel()
call. These were from an old attempt to increase the timeout- I have no idea if they were necessary at all, but I know they didn't work alone. One of them may be an llm-generated solution and never have been valid.
"SAGEMAKER_TS_RESPONSE_TIMEOUT": json.dumps(599),
"SAGEMAKER_MODEL_SERVER_TIMEOUT": json.dumps(599),
"SAGEMAKER_MODEL_INFERENCE_TIMEOUT": json.dumps(599),
Tag me if I can help further!
Hi @mohanasudhan @DocIsenberg @joelachance @lminer @ajaykarpur , I have deployed my Docker image to ECR, created a model, and configured an endpoint with 6GB of RAM for serverless inference in SageMaker. The endpoint works fine with shorter video inputs (e.g., 10 seconds), but when I send a 3-minute long video, I encounter a 500 error.
Initially, I suspected it was a memory issue, but upon checking, the memory utilization is only around 25%. I'm wondering if the error is related to a timeout issue.
Is this likely due to a timeout limit for serverless inference? If yes, is there a way to increase the timeout limit for serverless inference? Would increasing the timeout in my Docker container’s server file where I am using gunicorn help resolve this issue? Any guidance on resolving this would be appreciated!
Thank you.
Unbelievable, no wonder I resent using AWS every time I need to implement something new... Spent two days trying to figure out what was going on with my code that, even using GPUs, take like 10 min to run. I does all fine in the background, only the client that shuts after a minute. Absolutely ridiculous and frustrating! I will try async...
It does not make sense: Why AWS Lambda has 900 s timeout but SageMaker endpoint has only 60 s?
Async inference endpoints do not work with multi model endpoints either, you can not configure an endpoint to be both multi model capable and asynchronous. I am trying to deploy a multi model endpoint using the Nvidia Triton Inference container and when switching the models the request disconnects after 60 seconds as well.
Same here with multiple experiments using SageMaker endpoint with LMI with vllm backend: framework="djl-lmi", version="0.30.0"
1. SageMaker real-time invocation: when I use invoke_endpoint(), I got NO RESULTS within 60 seconds. An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Connection reset by peer for the endpoint-qwen2-vl-2024-11-18-01-08-03 endpoint. Please retry." 2 SageMaker streaming: when I changed to invoke_endpoint_with_response_stream(): I got streamed token output within 60 seconds, then right after 60 seconds I got An error occurred (ModelStreamError) when calling the InvokeEndpointWithResponseStream operation: Your model primary did not complete sending the inference response in the allotted time. 3 SageMaker Asynchronous inference: I also implemented async inference using LMI, but having no results in S3 and log with "failureReason":"ClientError: The response from container primary did not specify the required Content-Length header","
Seemed that 60 seconds limit is impacting long response
The current timeout for InvokeEndpoint is 60 seconds as specified here: https://docs.aws.amazon.com/en_pv/sagemaker/latest/dg/API_runtime_InvokeEndpoint.html
Is there any way we can increase this limit, to say 120 seconds?
***EDIT
Just to be clear, I was able to keep the process on the server running by passing an environment variable in the Model definition like so
Through CloudWatch, I was able to confirm that the task is still running even after 60 seconds. (For my usecase, I am processing a video frame by frame) My question is however, on the client side I am receiving this kind of error due to the timeout