ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
Although aws_autoscaler.py successfully spins up the EC2 instances, it seems like the worker queue is empty and I'm unable to utilize the EC2 instances as workers.
I also run the aws_autoscaler.py script on clearml deployed in Ubuntu 22.04 and on the free host https;//app.clear.ml/. I got similar behavior in both cases.
Any ideas on why this might be happening?
I have attached the logs and aws_autoscaler.yaml for reference.
configurations:
extra_clearml_conf: '
api {
api_server: "http://x.x.x.x:8008"
web_server: "http://x.x.x.x:8080"
files_server: "http://x.x.x.x:8081"
# Credentials are generated in the webapp, https://app.clear.ml/settings/workspace-configuration
# Overridden with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": "AAAAAAAAAAAA", "secret_key": "AAAAAAAAAAAA"}
# verify host ssl certificate, set to False only if you have a very good reason
verify_certificate: "False"
}
'
extra_trains_conf: ''
extra_vm_bash_script: ''
queues:
aws_autoscaler:
- - m42xlarge
- 1
resource_configurations:
m42xlarge:
ami_id: ami-04e601abe3e1a910f
availability_zone: eu-central-1a
ebs_device_name: /dev/sda1
ebs_volume_size: 100
ebs_volume_type: gp3
instance_type: m4.2xlarge
is_spot: false
key_name: assia-corractions-vm-key
security_group_ids: null
hyper_params:
cloud_credentials_key: AAAAAAAAAAAA
cloud_credentials_region: eu-central-1
cloud_credentials_secret: AAAAAAAAAAAA
cloud_provider: ''
default_docker_image: ubuntu:22.04
git_pass: ''
git_user: ''
max_idle_time_min: 15
max_spin_up_time_min: 30
polling_interval_time_min: 5
workers_prefix: dynamic_worker
I'm having some trouble with aws_autoscaler.py (https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py) and its ability to create new EC2 instances that should function as workers in the ClearML queue. We have a self hosted server running on Kubernetes.
Although aws_autoscaler.py successfully spins up the EC2 instances, it seems like the worker queue is empty and I'm unable to utilize the EC2 instances as workers.
I also run the aws_autoscaler.py script on clearml deployed in Ubuntu 22.04 and on the free host https;//app.clear.ml/. I got similar behavior in both cases.
Any ideas on why this might be happening?
I have attached the logs and aws_autoscaler.yaml for reference.
task_ubuntu_22.04_55358c7831004d388678ff64a46daf79.log