amzn / amazon-ray

Staging area for ongoing enhancements to Ray focused on improving integration with AWS and other Amazon technologies.
Apache License 2.0
66 stars 28 forks source link

[autoscaler] Using `ami-0f92e9d2b63bc61a2` fails with error "ERROR: ray-1.2.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl is not a supported wheel on this platform." #7

Open jennakwon06 opened 3 years ago

jennakwon06 commented 3 years ago

Problem

I am using ami-00f92e9d2b63bc61a2 which is supposed to be the ami for Linux - Python 3.7 - Ray 1.2.0.

I am using below yaml file, where my docker image 048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr is a custom image based off of 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.3.1-cpu-py37-ubuntu18.04.

cluster_name: jkkwon_ray_test

min_workers: 10
max_workers: 100
upscaling_speed: 1.0

docker: "
    image: "048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr"
    container_name: "miamiml_container"
    pull_before_run: True

idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d
    cache_stopped_nodes: False

auth:
    ssh_user: ubuntu
    ssh_private_key: miami_dev_dask_emr_key_pair.pem

head_node:
    InstanceType: r5n.24xlarge
    ImageId: ami-0f92e9d2b63bc61a2 # https://github.com/amzn/amazon-ray
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-02876545b671b57b0"
    ]
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100
    KeyName: "miami_dev_dask_emr_key_pair"

worker_nodes:
    InstanceType: r5n.24xlarge
    ImageId: ami-0f92e9d2b63bc61a2 # https://github.com/amzn/amazon-ray
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    InstanceMarketOptions:
        MarketType: spot
    KeyName: "miami_dev_dask_emr_key_pair"

file_mounts_sync_continuously: False
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
    - 
rsync_filter:
    - ".gitignore"

initialization_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

The problem is that running ray up fails with message


  [6/7] Running setup commands
    (0/2) echo 'export PATH="$HOME/anaco...
Shared connection to 10.0.0.34 closed.
    (1/2) pip install -U https://s3-us-w...
ERROR: ray-1.2.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl is not a supported wheel on this platform.
WARNING: You are using pip version 20.3.3; however, version 21.0.1 is available.
You should consider upgrading via the '/usr/local/bin/python3.7 -m pip install --upgrade pip' command.
Shared connection to 10.0.0.34 closed.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

When NOT using the docker image, I am able to actually get the Ray cluster up and running. But when I log onto it with ray attach and look at Python console, I get below:

Python 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
[1]+  Stopped                 python
ubuntu@ip-10-0-0-108:~$ python3
Python 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

I am wondering if Ray wheel was mis-uploaded for 3.6 version, not 3.7 version?

Thanks!

pdames commented 3 years ago

@jennakwon06 - make sure to add the following line to your autoscaler config to prevent default setup_commands from default.yaml (which may differ depending on the version of ray installed on the host running ray up) being automatically applied and trying to install a Python 3.6 Ray Wheel:

setup_commands: []

For example, I launched a cluster via ray up us-west-2-cp37-ray120-test.yaml from the same AMI using the following autoscaler config, and verified that the final result matched my expectations:

cluster_name: us-west-2-cp37-ray120-test 

max_workers: 1

provider:
  type: aws
  region: us-west-2
  availability_zone: us-west-2a

auth:
  ssh_user: ubuntu

head_node:
  InstanceType: r5n.xlarge
  ImageId: ami-0f92e9d2b63bc61a2
  SecurityGroupIds: 
    - sg-07f4b3353e442a2ce

worker_nodes:
  InstanceType: r5n.xlarge
  ImageId: ami-0f92e9d2b63bc61a2
  SecurityGroupIds: 
    - sg-07f4b3353e442a2ce

setup_commands: []
pdames$ ray attach us-west-2-cp37-ray120-test.yaml
ubuntu@ip-XXX-XX-XX-XXX:~$ pip show amzn-ray
Name: amzn-ray
Version: 1.2.0
Summary: Staging area for ongoing enhancements to Ray focused on improving its integration with AWS and other Amazon technologies.
Home-page: https://github.com/amzn/amazon-ray
Author: Amazon Ray Team
Author-email: amzn-ray-team@amazon.com
License: Apache 2.0
Location: /home/ubuntu/anaconda3/lib/python3.7/site-packages
Requires: numpy, jsonschema, aiohttp-cors, colorama, msgpack, redis, colorful, filelock, aiohttp, pyyaml, click, py-spy, grpcio, requests, opencensus, aioredis, prometheus-client, protobuf, gpustat
Required-by: 
ubuntu@ip-XXX-XX-XX-XXX:~$ python
Python 3.7.7 (default, Mar 26 2020, 15:48:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
jennakwon06 commented 3 years ago

I see. Sounds good. Thanks! It sounds like this could be a documentation improvement about the behavior of empty fields. I will leave this open until we improve that documentation.