Describe the bug
My goal is to follow the sagemaker example notebooks to train a MaskRcnn model, deploy it, and test it on an image.
I trained a MaskRcnn model following link
(I've reduced the COCO training set in S3 bucket to only one category (bear), which is about 1000 images. I did this to save some training time.)
And the training got completed status in the AWS console - Training - Training jobs
Then I run the notebook mask-rcnn-inference.ipynb with s3_model_url of the trained model above.
And then at
ep = sagemaker_session.create_endpoint(endpoint_name=endpoint_name, config_name=endpoint_config_name, wait=True)
I got error:
----------------------------------------------------------------------------------*
in
1 ep = sagemaker_session.create_endpoint(
----> 2 endpoint_name=endpoint_name, config_name=endpoint_config_name, wait=True
3 )
4 print(ep)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
2428 )
2429 if wait:
-> 2430 self.wait_for_endpoint(endpoint_name)
2431 return endpoint_name
2432
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
2697 ),
2698 allowed_statuses=["InService"],
-> 2699 actual_status=status,
2700 )
2701 return desc
UnexpectedStatusException: Error hosting endpoint mask-rcnn-model-COCObear-endpoint: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..
**To reproduce**
1. train a mask rcnn model following this [link](https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-s3.ipynb)
2. deploy the model as an inference following this [link](https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-inference.ipynb)
**Logs**
Following is the log from CloudWatch:
...
2021-10-01T14:56:09.694-05:00Copy#033[32m[1001 19:56:09 @registry.py:135]#033[0m maskrcnn output: [None, 80, 28, 28] | #033[32m[1001 19:56:09 @registry.py:135]#033[0m maskrcnn output: [None, 80, 28, 28]
-- | --
| 2021-10-01T14:56:09.694-05:00Copy#033[32m[1001 19:56:09 @collection.py:147]#033[0m New collections created in tower : tf.GraphKeys.MODEL_VARIABLES | #033[32m[1001 19:56:09 @collection.py:147]#033[0m New collections created in tower : tf.GraphKeys.MODEL_VARIABLES
| 2021-10-01T14:56:10.694-05:00CopyWARNING:tensorflow:From /mask-rcnn-tensorflow/tensorpack/tfutils/sessinit.py:122: The name tf.train.NewCheckpointReader is deprecated. Please use tf.compat.v1.train.NewCheckpointReader instead. | WARNING:tensorflow:From /mask-rcnn-tensorflow/tensorpack/tfutils/sessinit.py:122: The name tf.train.NewCheckpointReader is deprecated. Please use tf.compat.v1.train.NewCheckpointReader instead.
**(I think the log for errors stars from here)**
| 2021-10-01T14:56:10.694-05:00Copy2021/10/01 19:56:10 [error] 10#10: *259 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080" | 2021/10/01 19:56:10 [error] 10#10: *259 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080"
| 2021-10-01T14:56:15.697-05:00Copy169.254.178.2 - - [01/Oct/2021:19:56:10 +0000] "GET /ping HTTP/1.1" 502 182 "-" "AHC/2.0" | 169.254.178.2 - - [01/Oct/2021:19:56:10 +0000] "GET /ping HTTP/1.1" 502 182 "-" "AHC/2.0"
| 2021-10-01T14:56:15.697-05:00Copy2021/10/01 19:56:15 [error] 10#10: *259 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080" | 2021/10/01 19:56:15 [error] 10#10: *259 connect() to unix:/tmp/gunicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 169.254.178.2, server: , request: "GET /ping HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/ping", host: "169.254.180.2:8080"
| 2021-10-01T14:56:20.699-05:00Copy169.254.178.2 - - [01/Oct/2021:19:56:15 +0000] "GET /ping HTTP/1.1" 502 182 "-" "AHC/2.0" | 169.254.178.2 - - [01/Oct/2021:19:56:15 +0000] "GET /ping HTTP/1.1" 502 182 "-" "AHC/2.0"
...
(the omitted logs are a replication of the above errors)
**My research on this issue:**
According to this [link](https://forums.aws.amazon.com/thread.jspa?messageID=901674), it looks like this error happens because of limited available sockets. But I don't know how to change sockets number. And also, since it's still following the example notebook with only some editions on the dataset, I'm not expecting to dig into this corner technique. But maybe this will be the solution. I don't know.
I've got stuck on it for two weeks. Any help would be great! Many thanks!
Link to the notebook https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-inference.ipynb
Describe the bug My goal is to follow the sagemaker example notebooks to train a MaskRcnn model, deploy it, and test it on an image. I trained a MaskRcnn model following link (I've reduced the COCO training set in S3 bucket to only one category (bear), which is about 1000 images. I did this to save some training time.) And the training got completed status in the AWS console - Training - Training jobs Then I run the notebook mask-rcnn-inference.ipynb with s3_model_url of the trained model above. And then at ep = sagemaker_session.create_endpoint(endpoint_name=endpoint_name, config_name=endpoint_config_name, wait=True) I got error: ----------------------------------------------------------------------------------*
UnexpectedStatusException Traceback (most recent call last)