aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
134 stars 70 forks source link

Fixed handler service to allow running custom user modules in multi-m… #73

Closed giuseppeporcelli closed 4 years ago

giuseppeporcelli commented 4 years ago

…odel mode.

Issue #, if available:

Description of changes: I have fixed the handler service to allow including the 'code' dir (where user modules are stored) to the Python path. This is required for importing the custom user modules when the container is used in multi-model mode.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

ajaykarpur commented 4 years ago

@giuseppeporcelli Retried the sagemaker-pytorch-inference build and it looks like the same test timed out again:

=================================== FAILURES ===================================
________________________________ test_mnist_cpu ________________________________

sagemaker_session = <sagemaker.session.Session object at 0x7f4b80346320>
image_uri = '142577830533.dkr.ecr.us-west-2.amazonaws.com/sagemaker-test:1.4.0-pytorch-sagemaker-pytorch-inference-04131d15-2e47-4fe3-83da-1bf0e5551b62'
instance_type = 'ml.c4.xlarge'

    @pytest.mark.cpu_test
    def test_mnist_cpu(sagemaker_session, image_uri, instance_type):
        instance_type = instance_type or 'ml.c4.xlarge'
>       _test_mnist_distributed(sagemaker_session, image_uri, instance_type, model_cpu_tar, mnist_cpu_script)

test-toolkit/integration/sagemaker/test_mnist.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test-toolkit/integration/sagemaker/test_mnist.py:65: in _test_mnist_distributed
    endpoint_name=endpoint_name)
.tox/py36/lib/python3.6/site-packages/sagemaker/model.py:515: in deploy
    data_capture_config_dict=data_capture_config_dict,
.tox/py36/lib/python3.6/site-packages/sagemaker/session.py:2872: in endpoint_from_production_variants
    return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
.tox/py36/lib/python3.6/site-packages/sagemaker/session.py:2404: in create_endpoint
    self.wait_for_endpoint(endpoint_name)
.tox/py36/lib/python3.6/site-packages/sagemaker/session.py:2651: in wait_for_endpoint
    desc = _wait_until(lambda: _deploy_done(self.sagemaker_client, endpoint), poll)
.tox/py36/lib/python3.6/site-packages/sagemaker/session.py:3602: in _wait_until
    time.sleep(poll)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

signum = 14, frame = <frame object at 0x7f4b7b551238>

    def handler(signum, frame):
>       raise TimeoutError('timed out after {} seconds'.format(limit))
E       integration.sagemaker.timeout.TimeoutError: timed out after 1800 seconds

test-toolkit/integration/sagemaker/timeout.py:44: TimeoutError
giuseppeporcelli commented 4 years ago

I'm not able to replicate the issue locally. Can I have access to the logs of the endpoint being created and see why the deployment is not working? Thanks.