allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

Trouble with aws_autoscaler.py #1098

Closed assiakhateeb closed 1 year ago

assiakhateeb commented 1 year ago

I'm having some trouble with aws_autoscaler.py (https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py) and its ability to create new EC2 instances that should function as workers in the ClearML queue. We have a self hosted server running on Kubernetes.

Although aws_autoscaler.py successfully spins up the EC2 instances, it seems like the worker queue is empty and I'm unable to utilize the EC2 instances as workers.

I also run the aws_autoscaler.py script on clearml deployed in Ubuntu 22.04 and on the free host https;//app.clear.ml/. I got similar behavior in both cases.

Any ideas on why this might be happening?

I have attached the logs and aws_autoscaler.yaml for reference.

configurations:
  extra_clearml_conf: '
    api {
        api_server: "http://x.x.x.x:8008"
        web_server: "http://x.x.x.x:8080"
        files_server: "http://x.x.x.x:8081"

        # Credentials are generated in the webapp, https://app.clear.ml/settings/workspace-configuration
        # Overridden with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
        credentials {"access_key": "AAAAAAAAAAAA", "secret_key": "AAAAAAAAAAAA"}

        # verify host ssl certificate, set to False only if you have a very good reason
        verify_certificate: "False"
    }  
  '
  extra_trains_conf: ''
  extra_vm_bash_script: ''
  queues:
    aws_autoscaler:
      - - m42xlarge
        - 1
  resource_configurations:
    m42xlarge:
      ami_id: ami-04e601abe3e1a910f
      availability_zone: eu-central-1a
      ebs_device_name: /dev/sda1
      ebs_volume_size: 100
      ebs_volume_type: gp3
      instance_type: m4.2xlarge
      is_spot: false
      key_name: assia-corractions-vm-key
      security_group_ids: null
hyper_params:
  cloud_credentials_key: AAAAAAAAAAAA
  cloud_credentials_region: eu-central-1
  cloud_credentials_secret: AAAAAAAAAAAA
  cloud_provider: ''
  default_docker_image: ubuntu:22.04
  git_pass: ''
  git_user: ''
  max_idle_time_min: 15
  max_spin_up_time_min: 30
  polling_interval_time_min: 5
  workers_prefix: dynamic_worker

task_ubuntu_22.04_55358c7831004d388678ff64a46daf79.log

jkhenning commented 1 year ago

Hi @assiakhateeb, First of all, make sure the AMIs you're using have docker preinstalled

assiakhateeb commented 1 year ago

@jkhenning Thanks for your suggestion. It really helped to solve the problem.