aws-samples / host-yolov8-on-sagemaker-endpoint

MIT No Attribution
35 stars 24 forks source link

Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again. #21

Open arjunanand13 opened 1 month ago

arjunanand13 commented 1 month ago

image

ModelError Traceback (most recent call last) Cell In[5], line 12 10 resized_image = cv2.resize(orig_image, (model_height, model_width)) 11 payload = cv2.imencode('.jpg', resized_image)[1].tobytes() ---> 12 result = predictor.predict(payload) 14 infer_end_time = time.time() 16 print(f"Inference Time = {infer_end_time - infer_start_time:0.4f} seconds")

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/base_predictor.py:212, in Predictor.predict(self, data, initial_args, target_model, target_variant, inference_id, custom_attributes, component_name) 209 if inference_component_name: 210 request_args["InferenceComponentName"] = inference_component_name --> 212 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args) 213 return self._handle_response(response)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/botocore/client.py:565, in ClientCreator._create_api_method.._api_call(self, *args, **kwargs) 561 raise TypeError( 562 f"{py_operation_name}() only accepts keyword arguments." 563 ) 564 # The "self" in this scope is referring to the BaseClient. --> 565 return self._make_api_call(operation_name, kwargs)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/botocore/client.py:1021, in BaseClient._make_api_call(self, operation_name, api_params) 1017 error_code = error_info.get("QueryErrorCode") or error_info.get( 1018 "Code" 1019 ) 1020 error_class = self.exceptions.from_code(error_code) -> 1021 raise error_class(parsed_response, operation_name) 1022 else: 1023 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://eu-north-1.console.aws.amazon.com/cloudwatch/home?region=eu-north-1#logEventViewer:group=/aws/sagemaker/Endpoints/yolov8-pytorch-2024-07-18-05-43-42-493469 in account 0819869784 for more information.

ArtemChemist commented 1 month ago

Same here. I fixed it by changing the structure of the tar. In the file 1_DeployEndpoint.ipynb

  1. Comment out this line: os.system(f'mv {model_name} code/.')

  2. modify the following line: bashCommand = f"tar -cpzf model.tar.gz {model_name} code/"

That solved the issue for me.

ArtemChemist commented 1 month ago

Strangely enough, there was already a discussion of this issue and there is a pull request that should have addressed this...

arjunanand13 commented 1 week ago

Thank you very much , this resolved the issue