OpenDroneMap / opendronemap-ecs

Serverless API to get opendronemap tasks running on AWS Elastic Container Service (ECS)
GNU General Public License v3.0
5 stars 6 forks source link

instances not registering to cluster #14

Closed danbjoseph closed 6 years ago

danbjoseph commented 6 years ago

if i ssh into the 1 running instance of my auto scaling group, the setting from user-data.yml shows

[ec2-user@ip-xxx-xx-x-xxx ~]$ cat /etc/ecs/ecs.config
ECS_CLUSTER=odm

but

➜  ~ aws ecs describe-clusters --clusters "odm"
{
    "clusters": [
        {
            "status": "ACTIVE", 
            "statistics": [], 
            "clusterName": "odm", 
            "registeredContainerInstancesCount": 0, 
            "pendingTasksCount": 0, 
            "runningTasksCount": 0, 
            "activeServicesCount": 0, 
            "clusterArn": "arn:aws:ecs:us-east-1:xxxxxxxxxxxx:cluster/odm"
        }
    ], 
    "failures": []
}

the odm cluster in the Amazon ECS dashboard also notes 0 container instances

matthewberryman commented 6 years ago

Did you launch the cluster after setting up the cluster with the CLI tool?

matthewberryman commented 6 years ago

Also, are you using the ECS AMI? Separately, I should note in the docs that the regions must all match, but I can see from your email that you're operating in the same region us-east-1

danbjoseph commented 6 years ago

screen shot 2018-01-02 at 3 30 31 pm

matthewberryman commented 6 years ago

Ok. Some other things to check:

matthewberryman commented 6 years ago

You may need to change your launch configuration (if points in previous comment don't hold) and relaunch. I'm not sure about making a note in the readme as it's becoming bloated and once I fix #4 this will solve a lot of the issues.

See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/logs.html for location of ECS agent logs (in the ec2 instance) that can help with further troubleshooting.

danbjoseph commented 6 years ago

so for 'Network' under '1. Configure Auto Scaling group details' if i put "Launch into EC2-Classic" then I get error messages about VPC security rules for a non-VPC instance when it tries to start the initial instance. if i select "vpc-xxxxxxx (xxx.xx.x.x/xx)" then i get this message screen shot 2018-01-02 at 3 56 33 pm in my AWS VPC dashboard, it lists only the above one. and it is not default. if i select it, "create default VPC" is disabled in the action menu. if i create one should i be able to set it as default? what settings? screen shot 2018-01-02 at 4 01 17 pm are there implications for our wider AWS infrastructure?

matthewberryman commented 6 years ago

If you want to launch into EC2-classic then you could get rid of the conflicting VPC rules. Probably best to deselect the one you've created and then create a default one using the action menu, which will autoassign public IP addresses. Otherwise if you wanted to go with the separate one (to keep this separate from other services - VPC provides network isolation that can be useful security-wise) then the auto-assign feature is under VPC -> subnets.

screenshot 2018-01-03 08 07 22
danbjoseph commented 6 years ago
matthewberryman commented 6 years ago

Per https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/default-vpc.html seems you can't create a default VPC if your account is old enough that it supports ec2-classic - I guess you'd have to log a support ticket.

So, to launch into EC2-classic, you would need to modify your launch configuration (by copying the existing one, and then on the next step go back and modify), to change the security groups to EC2-Classic security groups.

Or per https://github.com/OpenDroneMap/opendronemap-ecs/issues/14#issuecomment-354875842 for each of the subnets associated with your vpc-85d2e3e2 go and change the autoassign setting.

danbjoseph commented 6 years ago

If I change to an AWS region that shows having a default VPC for my account (e.g. us-east-2) for the cluster, will it matter that the S3 bucket with all the imagery has the region listed as us-east-1?

matthewberryman commented 6 years ago

No, provided you don't block outbound traffic from the ec2 instances, or specify a region in the s3 policy, but unless you've changed things neither would be the case.

danbjoseph commented 6 years ago

hmmm. repeated everything in us-east-2

danbjoseph commented 6 years ago

on the initial instance cat /var/log/ecs/ecs-init.log shows a repeating list of:

2018-01-02T22:56:07Z [INFO] Removing existing agent container ID: 513a789ee4aa4d60e6633cad788f3926e032a73857799d2b6ce7da0267024775
2018-01-02T22:56:07Z [INFO] Starting Amazon EC2 Container Service Agent
2018-01-02T22:56:09Z [INFO] Agent exited with code 1
2018-01-02T22:56:09Z [INFO] Container name: /ecs-agent
2018-01-02T22:56:09Z [INFO] Removing existing agent container ID: 4ad1e35a71665b45b3e8b3204510ec5a0f7e4977c1838d5e8d27fa2d99020d0c
2018-01-02T22:56:09Z [INFO] Starting Amazon EC2 Container Service Agent
2018-01-02T22:56:09Z [INFO] Agent exited with code 1
2018-01-02T22:56:09Z [INFO] Container name: /ecs-agent
2018-01-02T22:56:09Z [INFO] Removing existing agent container ID: b6efcda2313c6117465b5308e8dbee612d1741b4ed9f066d1aa6394a7e04043e
2018-01-02T22:56:10Z [INFO] Starting Amazon EC2 Container Service Agent
2018-01-02T22:56:11Z [INFO] Agent exited with code 1
2018-01-02T22:56:11Z [INFO] Container name: /ecs-agent
2018-01-02T22:56:11Z [INFO] Removing existing agent container ID: 736c52a591de3c780791f02cd8d39872375e0fbaee44eb4a5811c57d0d1c2a3e
2018-01-02T22:56:11Z [INFO] Starting Amazon EC2 Container Service Agent
2018-01-02T22:56:12Z [INFO] Agent exited with code 1
2018-01-02T22:56:12Z [INFO] Container name: /ecs-agent
2018-01-02T22:56:12Z [INFO] Removing existing agent container ID: bd2b9eadb65f306fa42498084fc90e0da9d05fcf97e55320892f3da904971b56
2018-01-02T22:56:12Z [INFO] Starting Amazon EC2 Container Service Agent
danbjoseph commented 6 years ago

and also:

[ec2-user@ip-172-31-9-201 ~]$ cat /var/log/ecs/ecs-agent.log.2018-01-02-22 
2018-01-02T22:18:43Z [INFO] Loading configuration
2018-01-02T22:18:43Z [INFO] Loading state! module="statemanager"
2018-01-02T22:18:43Z [INFO] Event stream ContainerChange start listening...
2018-01-02T22:18:43Z [INFO] Creating root ecs cgroup: /ecs
2018-01-02T22:18:43Z [INFO] Creating cgroup /ecs
2018-01-02T22:18:43Z [INFO] Registering Instance with ECS
2018-01-02T22:18:43Z [ERROR] Could not register: AccessDeniedException: User: arn:aws:sts::499923577862:assumed-role/odm-ecs/i-0fa57f09750803dd3 is not authorized to perform: ecs:RegisterContainerInstance on resource: arn:aws:ecs:us-east-2:499923577862:cluster/odm
    status code: 400, request id: e4d1bfc2-f00a-11e7-91cb-27f6e55749c8
2018-01-02T22:18:43Z [ERROR] Error registering: AccessDeniedException: User: arn:aws:sts::499923577862:assumed-role/odm-ecs/i-0fa57f09750803dd3 is not authorized to perform: ecs:RegisterContainerInstance on resource: arn:aws:ecs:us-east-2:499923577862:cluster/odm
    status code: 400, request id: e4d1bfc2-f00a-11e7-91cb-27f6e55749c8
matthewberryman commented 6 years ago

Yeah it's a permissions issue. ( https://github.com/OpenDroneMap/opendronemap-ecs/issues/14#issuecomment-354870127 point 3). Back on the launch configuration screen note the IAM role (named IAM Instance Profile):

screenshot 2018-01-03 10 04 13

Back on the IAM console, search for that role:

screenshot 2018-01-03 10 05 42

Then under permissions you should have your s3 policy ( odm-ecs ) attached but also the system policy AmazonEC2ContainerServiceforEC2Role attached as well, if either is missing click on attach, search for those policies, and attach.

matthewberryman commented 6 years ago

(just noting I updated previous comment, which you might not pick up on GH notification emails, unless you click on link and refresh, as last line should read "search for those policies" )

danbjoseph commented 6 years ago

despite now having those added IAM permissions, i'm still getting the same error logs

[ec2-user@ip-172-31-47-147 ~]$ cat /var/log/ecs/ecs-agent.log.2018-01-02-23 
2018-01-02T23:18:15Z [INFO] Loading configuration
2018-01-02T23:18:15Z [INFO] Loading state! module="statemanager"
2018-01-02T23:18:15Z [INFO] Event stream ContainerChange start listening...
2018-01-02T23:18:15Z [INFO] Creating root ecs cgroup: /ecs
2018-01-02T23:18:15Z [INFO] Creating cgroup /ecs
2018-01-02T23:18:15Z [INFO] Registering Instance with ECS
2018-01-02T23:18:15Z [ERROR] Could not register: AccessDeniedException: User: arn:aws:sts::499923577862:assumed-role/odm_ecsInstanceRole/i-0f1ff62b790711126 is not authorized to perform: ecs:RegisterContainerInstance on resource: arn:aws:ecs:us-east-2:499923577862:cluster/odm
    status code: 400, request id: 35d054ef-f013-11e7-a747-bbbdbef005a8
2018-01-02T23:18:15Z [ERROR] Error registering: AccessDeniedException: User: arn:aws:sts::499923577862:assumed-role/odm_ecsInstanceRole/i-0f1ff62b790711126 is not authorized to perform: ecs:RegisterContainerInstance on resource: arn:aws:ecs:us-east-2:499923577862:cluster/odm
    status code: 400, request id: 35d054ef-f013-11e7-a747-bbbdbef005a8

screen shot 2018-01-02 at 6 23 59 pm

matthewberryman commented 6 years ago

I think you need to relaunch the cluster to get it to pick up the changes

matthewberryman commented 6 years ago

From the logs it looks like it's creating a separate instance role off of the main role at launch.

danbjoseph commented 6 years ago

deleted the auto scaling group, deleted the launch configuration, ran aws ecs delete-cluster --cluster odm, went through setup again. no dice.

matthewberryman commented 6 years ago

Ok I think we need to tee up a time when we can use Google Chrome Screen Sharing or something so I can step through and take a look at things.

danbjoseph commented 6 years ago

i think i may have been looking at a stackoverflow describing a similar problem and referenced the IAM role mentioned there instead of the one you noted in the comment. thanks for your help in figuring out i had the wrong one added.

matthewberryman commented 6 years ago

The naming convention used doesn't help—often they're too close to make sense of and it's only by reading the JSON (urgh) that I can figure out the intent. Glad you're up and running now. Closing this and will review the pull request making this policy issue clearer, and merge shortly.

ShahNewazKhan commented 5 years ago

Ok. Some other things to check:

Thanks, my issue was the Subnets in the VPC did not have IGW set up in the route tables.