aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.14k stars 6.78k forks source link

connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream #2960

Open Julia90 opened 3 years ago

Julia90 commented 3 years ago

Link to the notebook https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-inference.ipynb

Describe the bug My goal is to follow the sagemaker example notebooks to train a MaskRcnn model, deploy it, and test it on an image. I trained a MaskRcnn model following link (I've reduced the COCO training set in S3 bucket to only one category (bear), which is about 1000 images. I did this to save some training time.) And the training got completed status in the AWS console - Training - Training jobs Then I run the notebook mask-rcnn-inference.ipynb with s3_model_url of the trained model above. And then at ep = sagemaker_session.create_endpoint(endpoint_name=endpoint_name, config_name=endpoint_config_name, wait=True) I got error: ----------------------------------------------------------------------------------*

UnexpectedStatusException Traceback (most recent call last)

in 1 ep = sagemaker_session.create_endpoint( ----> 2 endpoint_name=endpoint_name, config_name=endpoint_config_name, wait=True 3 ) 4 print(ep) ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait) 2428 ) 2429 if wait: -> 2430 self.wait_for_endpoint(endpoint_name) 2431 return endpoint_name 2432 ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll) 2697 ), 2698 allowed_statuses=["InService"], -> 2699 actual_status=status, 2700 ) 2701 return desc UnexpectedStatusException: Error hosting endpoint mask-rcnn-model-COCObear-endpoint: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.. **To reproduce** 1. train a mask rcnn model following this [link](https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-s3.ipynb) 2. deploy the model as an inference following this [link](https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-inference.ipynb) **Logs** Following is the log from CloudWatch: ... 2021-10-01T14:56:09.694-05:00Copy#033[32m[1001 19:56:09 @registry.py:135]#033[0m maskrcnn output: [None, 80, 28, 28] | #033[32m[1001 19:56:09 @registry.py:135]#033[0m maskrcnn output: [None, 80, 28, 28] -- | --   | 2021-10-01T14:56:09.694-05:00Copy#033[32m[1001 19:56:09 @collection.py:147]#033[0m New collections created in tower : tf.GraphKeys.MODEL_VARIABLES | #033[32m[1001 19:56:09 @collection.py:147]#033[0m New collections created in tower : tf.GraphKeys.MODEL_VARIABLES   | 2021-10-01T14:56:10.694-05:00CopyWARNING:tensorflow:From /mask-rcnn-tensorflow/tensorpack/tfutils/sessinit.py:122: The name tf.train.NewCheckpointReader is deprecated. Please use tf.compat.v1.train.NewCheckpointReader instead. | WARNING:tensorflow:From /mask-rcnn-tensorflow/tensorpack/tfutils/sessinit.py:122: The name tf.train.NewCheckpointReader is deprecated. Please use tf.compat.v1.train.NewCheckpointReader instead. **(I think the log for errors stars from here)**   | 2021-10-01T14:56:10.694-05:00Copy2021/10/01 19:56:10 [error] 10#10: *259 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080" | 2021/10/01 19:56:10 [error] 10#10: *259 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080"   | 2021-10-01T14:56:15.697-05:00Copy169.254.178.2 - - [01/Oct/2021:19:56:10 +0000] "GET /ping HTTP/1.1" 502 182 "-" "AHC/2.0" | 169.254.178.2 - - [01/Oct/2021:19:56:10 +0000] "GET /ping HTTP/1.1" 502 182 "-" "AHC/2.0"   | 2021-10-01T14:56:15.697-05:00Copy2021/10/01 19:56:15 [error] 10#10: *259 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080" | 2021/10/01 19:56:15 [error] 10#10: *259 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080"   | 2021-10-01T14:56:20.699-05:00Copy169.254.178.2 - - [01/Oct/2021:19:56:15 +0000] "GET /ping HTTP/1.1" 502 182 "-" "AHC/2.0" | 169.254.178.2 - - [01/Oct/2021:19:56:15 +0000] "GET /ping HTTP/1.1" 502 182 "-" "AHC/2.0" ... (the omitted logs are a replication of the above errors) **My research on this issue:** According to this [link](https://forums.aws.amazon.com/thread.jspa?messageID=901674), it looks like this error happens because of limited available sockets. But I don't know how to change sockets number. And also, since it's still following the example notebook with only some editions on the dataset, I'm not expecting to dig into this corner technique. But maybe this will be the solution. I don't know. I've got stuck on it for two weeks. Any help would be great! Many thanks!
Tilakraj0308 commented 8 months ago

Any updates on how to fix this issue?? I am also facing the same issue. Any help would be much appreciated.