determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
3.04k stars 356 forks source link

Agent instances starting / stoppingšŸ›[bug] #7977

Closed humbleearth closed 1 year ago

humbleearth commented 1 year ago

Describe the bug

Using the guide for aws ec2 based deployment for determined, I have deployed the infra. But when an experiment is submitted, it starts creating an instance, and then stops and starts another. Any guidance where the issue could be and how to debug it further?

Below is the log:

INFO[2023-09-23T07:34:12Z] trial changed from state PAUSED to ACTIVE actor-local-addr=7d9498f1-db51-4b6a-85c1-ccd59473d0dc actor-system=master experiment-id=2 go-type=trial job-id=886a0e7e-90af-45a3-b71a-332c3f6eb9ca task-id=2.7d9498f1-db51-4b6a-85c1-ccd59473d0dc task-type=TRIAL trial-id=1 trial-run-id=0 INFO[2023-09-23T07:34:12Z] decided to allocate trial actor-local-addr=7d9498f1-db51-4b6a-85c1-ccd59473d0dc actor-system=master experiment-id=2 go-type=trial job-id=886a0e7e-90af-45a3-b71a-332c3f6eb9ca task-id=2.7d9498f1-db51-4b6a-85c1-ccd59473d0dc task-type=TRIAL trial-id=1 trial-run-id=0 INFO[2023-09-23T07:34:12Z] resources are requested by Trial 1 (Experiment 2) (Allocation ID: 2.7d9498f1-db51-4b6a-85c1-ccd59473d0dc.1) actor-local-addr=aux-pool actor-system=master allocation-id=2.7d9498f1-db51-4b6a-85c1-ccd59473d0dc.1 go-type=resourcePool resource-pool=aux-pool restore=false restoring=false INFO[2023-09-23T07:34:15Z] decided to launch 1 instances (type t2.xlarge) component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:34:16Z] launched 1/1 EC2 instances: i-0f37157814e71507b (Starting) aws-cluster=aux-pool INFO[2023-09-23T07:34:21Z] found state changes in 1 instances: i-0f37157814e71507b (Starting) component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:34:31Z] found state changes in 0 instances: component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:34:31Z] decided to launch 1 instances (type t2.xlarge) component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:34:32Z] launched 1/1 EC2 instances: i-0ae6169fde8a210a7 (Starting) aws-cluster=aux-pool INFO[2023-09-23T07:34:37Z] found state changes in 1 instances: i-0ae6169fde8a210a7 (Starting) component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:34:47Z] found state changes in 0 instances: component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:34:47Z] decided to launch 1 instances (type t2.xlarge) component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:34:48Z] launched 1/1 EC2 instances: i-08cfb027a7670922a (Starting) aws-cluster=aux-pool INFO[2023-09-23T07:34:53Z] found state changes in 1 instances: i-08cfb027a7670922a (Starting) component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:35:04Z] found state changes in 0 instances: component=provisioner resource-pool=aux-pool INFO[2023-09-23T07:35:04Z] decided to launch 1 instances (type t2.xlarge) component=provisioner resource-pool=aux-pool

Reproduction Steps

  1. deploy fashionmnist tutorial example from determined github
  2. check docker logs
  3. check ec2 board

Expected Behavior

agent instance should start and run the experiment.

Screenshot

not starting and stopping the instances.

Environment

Additional Context

No response

ioga commented 1 year ago

hello,

the guide for aws ec2 based deployment for determined can you please clarify which guide is this?

is this det deploy aws (if so, what are the parameters of the deployment), is this raw cloudformation template, or the very low level one?

Any guidance where the issue could be and how to debug it further?

humbleearth commented 1 year ago

det deploy aws does not work with govcloud. This is a raw cloud formation template taken from the govcloud.yaml sample templates. The instance shutsdown 1s after starting, so no opportunity to ssh and check logs. will give a look at the cloudwatch.

ioga commented 1 year ago

I'd suggest trying to launch an instance from the agent AMI by hand. once upon a time we had a govcloud user who reported similar symptoms. they've debugged it and the problem was:

it wasn't creating the EBS volume because the IAM Role was not in the KMS keys user list.

humbleearth commented 1 year ago

another set of logs where I requested for a jupyter lab and it kept starting instances one after another with each going to shutdown state in moments.

INFO[2023-09-24T08:06:13Z] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-ai-stack-database-1.cluster-clmqsrouqcuk.us-gov-west-1.rds.amazonaws.com","port":"5432","name":"determined","ssl_mode":"verify-ca","ssl_root_cert":"/etc/determined/db_ssl_root_cert.pem"},"tensorboard_timeout":300,"notebook_timeout":null,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0,"RelatedUser":null},"tls":{"cert":"","key":""},"ssh":{"rsa_key_size":1024},"authz":{"type":"basic","fallback":"basic","rbac_ui_enabled":null,"_strict_ntsc_enabled":false,"workspace_creator_assign_role":{"enabled":true,"role_id":2},"strict_job_queue_control":false}},"checkpoint_storage":{"access_key":null,"bucket":"det-determined-ai-stack-us-gov-west-1-1","endpoint_url":null,"prefix":null,"save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"secret_key":null,"type":"s3"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null,"image":{"cpu":"determinedai/environments:py-3.8-pytorch-1.12-tf-2.8-cpu-9d07809","cuda":"determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.11-gpu-2b7e2a1"},"add_capabilities":null,"drop_capabilities":null,"devices":null,"bind_mounts":null,"work_dir":null,"slurm":{},"pbs":{}},"port":8080,"root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","otel_enabled":false,"otel_endpoint":"localhost:4317","segment_webui_key":"********"},"enable_cors":false,"launch_error":true,"cluster_name":"","logging":{"type":"default"},"observability":{"enable_prometheus":false},"cache":{"cache_dir":"/var/cache/determined"},"webhooks":{"base_url":"","signing_key":"9b4fb1578954"},"feature_switches":[],"resource_manager":{"client_ca":"","default_aux_resource_pool":"aux-pool","default_compute_resource_pool":"compute-pool","require_authentication":false,"scheduler":{"allow_heterogeneous_fits":false,"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"aux-pool","description":"","provider":{"agent_config_file_contents":null,"agent_docker_image":"determinedai/determined-agent:0.25.1","agent_docker_network":"default","agent_docker_runtime":"runc","agent_fluent_image":"","agent_reconnect_attempts":100,"agent_reconnect_backoff":5,"container_startup_script":"","cpu_slots_allowed":false,"custom_tags":null,"iam_instance_profile_arn":"arn:aws-us-gov:iam::1:instance-profile/service-determined-agent-instance-profile","image_id":"ami-0d26002529bfac823","instance_name":"determined-agent-determined-ai-stack","instance_type":"t2.xlarge","launch_error_retries":0,"launch_error_timeout":null,"log_group":"/determined/determined-ai-stack","log_stream":"determined-agent","master_cert_name":"","master_url":"http://local-ipv4:8080","max_agent_starting_period":"20m0s","max_idle_agent_period":"10m0s","max_instances":1,"min_instances":0,"network_interface":{"public_ip":false,"security_group_id":"sg-0ed46c91b7bdec058","subnet_id":"subnet-0d3750a916f4aac75"},"region":"","root_volume_size":200,"spot":false,"spot_max_price":"","ssh_key_name":"determined-key-pair","startup_script":"********","tag_key":"det-determined-ai-stack","tag_value":"det-agent-determined-ai-stack","type":"aws"},"max_aux_containers_per_agent":0,"task_container_defaults":null,"agent_reattach_enabled":false,"agent_reconnect_wait":"25s","kubernetes_namespace":""},{"pool_name":"compute-pool","description":"","provider":{"agent_config_file_contents":null,"agent_docker_image":"determinedai/determined-agent:0.25.1","agent_docker_network":"default","agent_docker_runtime":"runc","agent_fluent_image":"","agent_reconnect_attempts":100,"agent_reconnect_backoff":5,"container_startup_script":"","cpu_slots_allowed":true,"custom_tags":null,"iam_instance_profile_arn":"arn:aws-us-gov:iam::1:instance-profile/service-determined-agent-instance-profile","image_id":"ami-0d26002529bfac823","instance_name":"determined-agent-determined-ai-stack","instance_type":"m5.large","launch_error_retries":0,"launch_error_timeout":null,"log_group":"/determined/determined-ai-stack","log_stream":"determined-agent","master_cert_name":"","master_url":"http://local-ipv4:8080","max_agent_starting_period":"20m0s","max_idle_agent_period":"10m0s","max_instances":1,"min_instances":0,"network_interface":{"public_ip":false,"security_group_id":"sg-0ed46c91b7bdec058","subnet_id":"subnet-0d3750a916f4aac75"},"region":"","root_volume_size":200,"spot":false,"spot_max_price":"","ssh_key_name":"determined-key-pair","startup_script":"********","tag_key":"det-determined-ai-stack","tag_value":"det-agent-determined-ai-stack","type":"aws"},"max_aux_containers_per_agent":0,"task_container_defaults":null,"agent_reattach_enabled":false,"agent_reconnect_wait":"25s","kubernetes_namespace":""}],"__internal":{"audit_logging_enabled":false,"external_sessions":{"login_uri":"","logout_uri":"","jwt_key":""}}} 
INFO[2023-09-24T08:06:13Z] Determined master 0.25.1 (built with go1.21.0) 
INFO[2023-09-24T08:06:13Z] connecting to database determined-ai-stack-database-hfotitw58u0y.cluster-clmqsrouqcuk.us-gov-west-1.rds.amazonaws.com:5432 
INFO[2023-09-24T08:06:14Z] running DB migrations from file:///usr/share/determined/master/static/migrations; this might take a while... 
INFO[2023-09-24T08:06:17Z] migrated from 0 to 20230817155036            
INFO[2023-09-24T08:06:17Z] DB migrations completed                      
INFO[2023-09-24T08:06:17Z] deleting all snapshots for terminal state experiments 
INFO[2023-09-24T08:06:17Z] Generating a new CA certificate and key      
INFO[2023-09-24T08:06:20Z] Saved certificate and key to DB              
INFO[2023-09-24T08:06:20Z] Generating a new certificate and key for master 
INFO[2023-09-24T08:06:24Z] Saved certificate and key to DB              
INFO[2023-09-24T08:06:24Z] creating resource pool: aux-pool              actor-local-addr=agentRM actor-system=master go-type=agentResourceManager
INFO[2023-09-24T08:06:24Z] pool aux-pool using global scheduling config  actor-local-addr=agentRM actor-system=master go-type=agentResourceManager
INFO[2023-09-24T08:06:24Z] creating resource pool: compute-pool          actor-local-addr=agentRM actor-system=master go-type=agentResourceManager
INFO[2023-09-24T08:06:24Z] pool compute-pool using global scheduling config  actor-local-addr=agentRM actor-system=master go-type=agentResourceManager
INFO[2023-09-24T08:06:24Z] found provisioner configuration               actor-local-addr=aux-pool actor-system=master go-type=resourcePool resource-pool=aux-pool
INFO[2023-09-24T08:06:24Z] connecting to AWS                             actor-local-addr=aux-pool actor-system=master go-type=resourcePool resource-pool=aux-pool
INFO[2023-09-24T08:06:24Z] found provisioner configuration               actor-local-addr=compute-pool actor-system=master go-type=resourcePool resource-pool=compute-pool
INFO[2023-09-24T08:06:24Z] connecting to AWS                             actor-local-addr=compute-pool actor-system=master go-type=resourcePool resource-pool=compute-pool
INFO[2023-09-24T08:06:24Z] scheduling next resource allocation aggregation in 15h54m36s at 2023-09-25 00:01:00 +0000 UTC  actor-local-addr=allocation-aggregator actor-system=master go-type=allocationAggregator
INFO[2023-09-24T08:06:24Z] telemetry reporting is enabled; run with --telemetry-enabled=false to disable  clusterID=67277a29-24cd-46f6-9723-d9ed3c05749e component=telemetry segmentKey=4ZQ38oSKl4tV5JSWkbjv6ziijHY1SrE7
INFO[2023-09-24T08:06:24Z] accepting incoming connections on port 8080  
INFO[2023-09-24T08:16:23Z] resources are requested by JupyterLab (immensely-epic-cow) (Allocation ID: f4bf7186-0b4c-4ba8-93f8-9a7111282e91.1)  actor-local-addr=compute-pool actor-system=master allocation-id=f4bf7186-0b4c-4ba8-93f8-9a7111282e91.1 go-type=resourcePool resource-pool=compute-pool restore=false restoring=false
INFO[2023-09-24T08:16:28Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:16:29Z] launched 1/1 EC2 instances: i-0319901a36a65c6ee (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:16:34Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:16:35Z] launched 1/1 EC2 instances: i-0ba85cc542f487b66 (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:16:40Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:16:41Z] launched 1/1 EC2 instances: i-0a586fb532ce423ea (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:16:46Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:16:47Z] launched 1/1 EC2 instances: i-06f5f3d662068295c (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:16:52Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:16:53Z] launched 1/1 EC2 instances: i-0b092f1cb259a6bdf (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:16:58Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:16:59Z] launched 1/1 EC2 instances: i-0c69b80a4ff5c7c18 (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:17:04Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:17:05Z] launched 1/1 EC2 instances: i-07a4e45af89859d2c (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:17:10Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:17:11Z] launched 1/1 EC2 instances: i-04b3ee4e9fb6a06c8 (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:17:16Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:17:17Z] launched 1/1 EC2 instances: i-06b45e6ac491a5d61 (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:17:22Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:17:23Z] launched 1/1 EC2 instances: i-038f1dffb664883cc (Starting)  aws-cluster=compute-pool
INFO[2023-09-24T08:17:28Z] decided to launch 1 instances (type m5.large)  component=provisioner resource-pool=compute-pool
INFO[2023-09-24T08:17:29Z] launched 1/1 EC2 instances: i-099eb4bf4df25c928 (Starting)  aws-cluster=compute-pool
humbleearth commented 1 year ago

Tried creating by hand and the ami starts. Also, I am not sure which ami image is for master and which for agent while searching for amis in community in us-gov-west-1 region. I am just using the amis listed here. The agent ami seems to be deprecated with "DeprecationTime": "2023-03-05T17:00:20.000Z". Do you have a list of amis for us-gov-west-1 region for both master and agent?

humbleearth commented 1 year ago

On checking the details for the agent instance which terminated I found below error:

State transition message
 Client.InternalError: Client error on launch

On further exploration, I found this link which shares below:

Client.InternalError: Client error on launch ā€” Ensure that you have the permissions required to access the AWS KMS keys used to decrypt and encrypt volumes. For more information, see [Using key policies in AWS KMS](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html) in the AWS Key Management Service Developer Guide.

What should I make of this given I am able to start an agent instance manually though and is this relevant?

I also see that starting determined cluster in personal aws environment, starts an agent instance without an encryption key. Is there a way to specify encryption and kms key in the configuration in cloudformation?

ioga commented 1 year ago

What should I make of this given I am able to start an agent instance manually though and is this relevant?

this seems to be in line with the KMS problem our user has encountered that I've mentioned earlier.

My guess would be that the IAM instance profile or role the cloudformation template created needs to be added to KMS or given access to it. I cannot provide specific steps for this as we don't have a ton of experience with KMS or govcloud here. Perhaps you could engage with AWS support to figure this out?

ioga commented 1 year ago

Tried creating by hand and the ami starts. Also, I am not sure which ami image is for master and which for agent while searching for amis in community in us-gov-west-1 region. I am just using the amis listed here. The agent ami seems to be deprecated with "DeprecationTime": "2023-03-05T17:00:20.000Z". Do you have a list of amis for us-gov-west-1 region for both master and agent?

There was a problem with the setup which is supposed to update govcloud agent amis. These should be the latest: https://github.com/determined-ai/determined/pull/7983/files#diff-9ba300cda7190935713b42ed001cc3f744eefc73f5c9eeba0177dfa88e054f72R8

I don't think this would fix anything though, I believe the KMS permissions is the root cause.

humbleearth commented 1 year ago

Thanks for the ami update. Related to KMS, the iam role has full access to KMS. Since the default setup of determined doesn't use kms for encryption, is there any setup parameter in the yaml which needs to be passed for kms encryption enabled with the key to be used.

vaskokj commented 1 year ago

@ioga That was me :). This is the same exact problem.

It came full circle, here is what you need to add to get this working...

to your master policy in IAM add the following, I am using the one I named it as, service-determinedai-master

        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "kms:ListKeys",
                "kms:GetPublicKey",
                "kms:DescribeKey"
            ],
            "Resource": "*"
        },
        {
            "Sid": "determinedaiPassRole",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "arn:aws-us-gov:iam::########:role/service-determinedai-agent-role"
        }

My full IAM policy for master...

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "determinedaiAgentPolicy",
            "Effect": "Allow",
            "Action": [
                "ec2:AttachVolume",
                "ec2:CancelSpotInstanceRequests",
                "ec2:DeleteVolume",
                "ec2:ModifyVolume",
                "ec2:DescribeInstances",
                "ec2:TerminateInstances",
                "ec2:RequestSpotInstances",
                "ec2:CreateTags",
                "ec2:RunInstances",
                "ec2:DescribeSpotInstanceRequests",
                "ec2:DescribeVolumes",
                "ec2:DescribeVolumeStatus",
                "ec2:CreateVolume"
            ],
            "Resource": "*"
        },
        {
            "Sid": "determinedkms",
            "Effect": "Allow",
            "Action": [
                "kms:ListKeys",
                "kms:GetPublicKey",
                "kms:DescribeKey"
            ],
            "Resource": "*"
        },
        {
            "Sid": "determinedaiPassRole",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "arn:aws-us-gov:iam::::########::role/service-determinedai-agent-role"
        }
    ]
}

In the KMS policy permissions need to add

        {
            "Sid": "Allow use of the key",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws-us-gov:iam::#######:role/service-determinedai-master-role"
                ]
            },

and

        {
            "Sid": "Allow attachment of persistent resources",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws-us-gov:iam::#######:role/service-determinedai-master-role",
                ]
            },
            "Action": [
                "kms:CreateGrant",
                "kms:ListGrants",
                "kms:RevokeGrant"
            ],
            "Resource": "*",
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": "true"
                }
            }
        }

This fixes the problem.

ioga commented 1 year ago

@vaskokj thank you very much!

humbleearth commented 1 year ago

@vaskokj @ioga Thanks for the help. kms was the issue it seems :)