aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
134 stars 72 forks source link

ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code #156

Open celsofranssa opened 1 year ago

celsofranssa commented 1 year ago

I am using the Sagemaker Pytorch Estimator based on a custom docker image stored in AWS ECR.

from sagemaker.pytorch.estimator import PyTorch

    role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="local", # ml.g4dn.2xlarge
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Sagemaker correctly clones the sources from GitHub and performs the checkout into the specified branch.

The Bug: However, it only copies the main.py to /opt/ml/code inside the container instead of the holy-cloned source code, which causing ModuleNotFoundError: No module named 'source':

Traceback (most recent call last):
2y9byzwyxr-algo-1-reuoy  |   File "/opt/ml/code/main.py", line 15, in <module>
2y9byzwyxr-algo-1-reuoy  |     from source.helper.EvalHelper import EvalHelper
2y9byzwyxr-algo-1-reuoy  | ModuleNotFoundError: No module named 'source'

Logging the /opt/ml/code content only shows the main.py:

print(f"Content: {os.listdir(os.getcwd())}")
['main.py']