aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.14k forks source link

SageMaker Local Mode: Error for network sagemaker-local #4659

Open bzheng06 opened 4 months ago

bzheng06 commented 4 months ago

Describe the bug Running the ScriptProcessor in local mode on either a desktop or notebook instance results in a RuntimeError when it reaches the 'docker compose up' command.

To reproduce First run commands: docker network rm sagemaker-local docker network create sagemaker-local


sess = boto3.Session()
sm = sess.client("sagemaker")
s3_resource = boto3.resource("s3")
sagemaker_session = LocalSession()
config = {
    "local": {
        "local_code": True,
    }
}
sagemaker_session.config = config
sagemaker_session.s3_resource = s3_resource
sagemaker_session.s3_client = sm

network_config = NetworkConfig(
    enable_network_isolation=False,
    security_group_ids=security_group_ids, # security group ids
    subnets=subnets, # subnets
    encrypt_inter_container_traffic=True,
)

script_processor = ScriptProcessor(
    command=["bash", "process.sh"],
    image_uri=image, # custom Docker image
    role=role,
    instance_count=1,
    instance_type="local",
    sagemaker_session=sagemaker_session,
    network_config=network_config,
).run(...)

Just running 'docker rm sagemaker-local' and then the code in local mode is a workaround for the issue as of now.

Expected behavior For the script processing job to run and output properly.

Screenshots or logs image image

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.

mufaddal-rohawala commented 4 months ago

@bzheng06 Thanks for reaching out to sagemaker! Ideally users are not required to handle any delete/create network for local mode. The issue you are encountering seems to be coming from docker-compose, and seems to be due to the network creation not handled by docker compose itself.

It seems the work around you mentioned is the expected path forward really. Is there a use-case why you needed to delete/create the "sagemaker-local" network as a pre-step here?

bzheng06 commented 4 months ago

@mufaddal-rohawala I opened up a PR with the bug fix right here if you'd like to take a look: https://github.com/aws/sagemaker-python-sdk/pull/4699