aws-samples / amazon-sagemaker-local-mode

Amazon SageMaker Local Mode Examples
MIT No Attribution
242 stars 59 forks source link

local-gpu docker-compose version incompatibility #36

Closed shakedel closed 1 year ago

shakedel commented 1 year ago

When I try to estimator.fit() with a local_gpu instance type, I get the following error:

The Compose file '/tmp/tmpemtqjw1x/docker-compose.yaml' is invalid because: Unsupported config option for services.algo-1-x260e: 'deploy'

The generated YAML is:

networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-lrlsg:
    command: train
    container_name: anj25ec7e5-algo-1-lrlsg
    deploy:
      resources:
        reservations:
          devices:
          - capabilities:
            - gpu
    environment:
    - AWS_REGION=eu-central-1
    - TRAINING_JOB_NAME=my_ecr-2023-07-03-21-51-01-644
    image: xxxxxxxxxxxx.dkr.ecr.eu-central-1.amazonaws.com/my_ecr:latest
    networks:
      sagemaker-local:
        aliases:
        - algo-1-lrlsg
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmp56de_vqt/algo-1-lrlsg/output:/opt/ml/output
    - /tmp/tmp56de_vqt/algo-1-lrlsg/output/data:/opt/ml/output/data
    - /tmp/tmp56de_vqt/algo-1-lrlsg/input:/opt/ml/input
    - /tmp/tmp56de_vqt/model:/opt/ml/model
    - /home/ubuntu/work/feuerkrieg_division:/opt/ml/input/data/instance
version: '2.3'

I am new to docker compose, but after reading a bit I think the problem is that deploy was introduced in v3, but that the YAML version is 2.3 Also I dont know how this interacts with the fact that I had to install v1 (docker-compose rather than docker compose) for the execution to reach this failure. I saw no ENV_VAR I could set to change either:

P.S. Here is where the deploy command comes from https://github.com/aws/sagemaker-python-sdk/blob/052d7e00f236b19260b584ee7d63e2141e5053fb/src/sagemaker/local/image.py#L769-L772

shakedel commented 1 year ago

Some things I forgot to mention: I am using a BYOC. docker engine version 24.0.2 docker-compose version 1.25.0

eitansela commented 1 year ago

Hi @shakedel , which sample in this GitHub repo are you trying to run?

shakedel commented 1 year ago

@eitansela now that you ask I realize it is not from this repo but from another one owned by AWS. Here is the notebook I copied: https://github.com/aws/amazon-sagemaker-examples/blob/09f6fad6de75a4520f6f71d661f4b7a8139ce736/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb You can scroll down to the section named "SageMaker Python SDK Local Training"

But I don't think it matters much as my code is fairly simple and straightforward:

Estimator(
    role='my_role',
    instance_count=1,
    instance_type="local_gpu",
    image_uri="xxxxxxxxxxxx.dkr.ecr.eu-central-1.amazonaws.com/my_ecr:latest",
).fit({
    'channel1': "file:///path/to/channel1/data",
    'channel2': "file:///path/to/channel2/data"
})

The contents of the docker image are irrelevant since the failure occurs when validating the generated docker-compose.yaml does not reach that stage

eitansela commented 1 year ago

Hi @shakedel , please open an issue in the amazon-sagemaker-examples repo. Closing the issue here.

eitansela commented 1 year ago

Hi @shakedel , please open an issue in the amazon-sagemaker-examples repo. Closing the issue here.

shakedel commented 1 year ago

Fur future sake I think I found the cause for this error. In my Dockerfile I used the SAGEMAKER_PROGRAM env var rather than having a train and serve scripts. Seems like local mode uses the train script.