Closed shakedel closed 1 year ago
Some things I forgot to mention: I am using a BYOC. docker engine version 24.0.2 docker-compose version 1.25.0
Hi @shakedel , which sample in this GitHub repo are you trying to run?
@eitansela now that you ask I realize it is not from this repo but from another one owned by AWS. Here is the notebook I copied: https://github.com/aws/amazon-sagemaker-examples/blob/09f6fad6de75a4520f6f71d661f4b7a8139ce736/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb You can scroll down to the section named "SageMaker Python SDK Local Training"
But I don't think it matters much as my code is fairly simple and straightforward:
Estimator(
role='my_role',
instance_count=1,
instance_type="local_gpu",
image_uri="xxxxxxxxxxxx.dkr.ecr.eu-central-1.amazonaws.com/my_ecr:latest",
).fit({
'channel1': "file:///path/to/channel1/data",
'channel2': "file:///path/to/channel2/data"
})
The contents of the docker image are irrelevant since the failure occurs when validating the generated docker-compose.yaml
does not reach that stage
Hi @shakedel , please open an issue in the amazon-sagemaker-examples
repo. Closing the issue here.
Hi @shakedel , please open an issue in the amazon-sagemaker-examples
repo. Closing the issue here.
Fur future sake I think I found the cause for this error. In my Dockerfile I used the SAGEMAKER_PROGRAM
env var rather than having a train
and serve
scripts. Seems like local mode uses the train
script.
When I try to
estimator.fit()
with alocal_gpu
instance type, I get the following error:The generated YAML is:
I am new to docker compose, but after reading a bit I think the problem is that
deploy
was introduced in v3, but that the YAML version is 2.3 Also I dont know how this interacts with the fact that I had to install v1 (docker-compose
rather thandocker compose
) for the execution to reach this failure. I saw no ENV_VAR I could set to change either:docker-compose
command https://github.com/aws/sagemaker-python-sdk/blob/052d7e00f236b19260b584ee7d63e2141e5053fb/src/sagemaker/local/image.py#L718C15-L718C15P.S. Here is where the
deploy
command comes from https://github.com/aws/sagemaker-python-sdk/blob/052d7e00f236b19260b584ee7d63e2141e5053fb/src/sagemaker/local/image.py#L769-L772