Docker for AWS doesn't appear to benefit from ECR Policy during Stack / Service Creation

ambrons commented 7 years ago

Expected behavior

When adding Amazon Policy "AmazonEC2ContainerRegistryReadOnly" to the created cloud formation from Docker for AWS this should allow docker find images without having to use aws ecr get-login. Future updates of the service via the scheduler should also be able to access ECR without needing cached credentials via the --with-registry-auth command line argument.

This would give docker swarm parity with ECS provided by Amazon when using ECR as a private registry among not being confusing that Docker for AWS uses some policies, but ignores others.

Actual behavior

Using docker stack deploy -c <composefile.yaml> <stack name> with added Amazon Policy "AmazonEC2ContainerRegistryReadOnly" results in the image not being able to be found:

$ docker-swarm service ps 3sgzt87ip19t
ID            NAME                   IMAGE                                                                            NODE                           DESIRED STATE  CURRENT STATE            ERROR                             PORTS
x6l8jtn9hnw0  esports_wallets.1      ###.dkr.ecr.us-east-1.amazonaws.com/path/to/image:tag  ip-172-31-20-65.ec2.internal   Ready          Rejected 2 seconds ago   "No such image: ###.d…"

Using aws ecr get-login works during the creation using the following command form docker stack deploy -c <composefile.yaml> --with-registry-auth <stack name>, however after the Amazon password expires (12 hours) any mangers or nodes not primed will not be able to find the image as shown above.

Information

Using Docker for AWS just release with docker version 1.13.0. I'm able to use docker stack deploy -c <composefile.yaml> --with-registry-auth <stack name> without issue having done a aws ecr get-login... prior.

However because AWS ECR docker login passwords are only good for 12 hours if the scheduler rebalances the deployed services after 12 hours and the node it's being deployed on doesn't have the existing authentication it fails to authenticate and you fail to find the image.

I've read through the issues and found this issue talking about enhancements to how --with-registry-auth works to allow it to use refresh tokens or the like. https://github.com/docker/docker/issues/24940

I choose to create a new ticket as https://github.com/docker/docker/issues/24940 appears to be more geared towards solving dockerhub and less about ECR. I also opened this at https://github.com/docker/docker/issues/30713, but it was requested that I reopen the issue here.

I've also looked at several posts on the docker for aws forums. The one that pin points the problem with no real solution is this:

https://forums.docker.com/t/possible-to-use-aws-ecr-for-image-registry/22295/3

Considering this is a Docker for AWS and you have developed Docker AWS containers to run on the EC2 instances specific to AWS, why can't we just include the ECR ReadOnly Policy to the ProxyRole created during cloud formation to have access to ECR without having to worry about authentication at all?

I have attempted to add the Policy (AmazonEC2ContainerRegistryReadOnly) to the created ProxyRole used by the InstanceProfile for all Docker Nodes in the cloud formation, however this has zero effect. The result from a service ps is as follows:

$ docker-swarm service ps 3sgzt87ip19t
ID            NAME                   IMAGE                                                                            NODE                           DESIRED STATE  CURRENT STATE            ERROR                             PORTS
x6l8jtn9hnw0  esports_wallets.1      ###.dkr.ecr.us-east-1.amazonaws.com/path/to/image:tag  ip-172-31-20-65.ec2.internal   Ready          Rejected 2 seconds ago   "No such image: ###.d…"

However if I remove the service and create it after typing $(aws ecr get-login --region us-east-1) it works fine, finds the image and runs. However then it's back to 12 hours later I'm likely in the state referenced in the forum link above or in issue #24940.

We've been using ECS for over a year and the behavior of making sure the Instance profile role has the AmazonEC2ContainerRegistryReadOnly policy is all that's needed.

I dug a little down in the dockers deployed on an EC2 instance, cool stuff there how the SSH is actually a docker as well as the metadata service. My assumption is we're not trying to assume the instance policy role during a registry fetch. Or I could be talking out my butt there.

I've even gone as far as modifying the cloud formation template to include the ECR policies at creation time, but made no difference.

Steps to reproduce the behavior

Use AWS Console to apply AmazonEC2ContainerRegistryReadOnly policy to cloud formation generated ProxyRole used by InstancePolicy of Docker For AWS
Create service or stack with an image referenced in ECR
service ps or container inspect to see that image is not found aka authentication wasn't present.

Output of docker version:

Docker version 1.13.0, build 49bf474

Output of docker info:

Containers: 4 Running: 4 Paused: 0 Stopped: 0 Images: 6 Server Version: 1.13.0 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: awslogs Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Swarm: active NodeID: 6kup6o2m8rmgrydrj9k3p1089 Is Manager: true ClusterID: sgq7930ifiv21xlc94jci3tsd Managers: 1 Nodes: 4 Orchestration: Task History Retention Limit: 5 Raft: Snapshot Interval: 10000 Number of Old Snapshots to Retain: 0 Heartbeat Tick: 1 Election Tick: 3 Dispatcher: Heartbeat Period: 5 seconds CA Configuration: Expiry Duration: 3 months Node Address: 172.31.40.242 Manager Addresses: 172.31.40.242:2377 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 4.9.4-moby Operating System: Alpine Linux v3.5 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 7.785 GiB Name: ip-172-31-40-242.ec2.internal ID: REJ6:Q2GX:RYQD:KC3A:2ZK2:5JZJ:PT42:BT6B:64TN:XQED:WKBH:H7PC Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): true File Descriptors: 77 Goroutines: 190 System Time: 2017-02-03T14:48:08.410775982Z EventsListeners: 0 Registry: https://index.docker.io/v1/ Experimental: true Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

AWS / Docker for AWS 1 manager and 3 workers

kencochrane commented 7 years ago

@ambrons so to sum up the issue it looks like there are 2 problems.

Need the IAM permissions to get the ecr tokens (http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly)
The ecr tokens expire every 12 hours, so we need a way to update these tokens before they expire.

Did I miss anything?

ambrons commented 7 years ago

@kencochrane

Sorry for my long winded report I could see how it could be a little confusing.

The primary concern is your number one. I'm not familiar with the internals of how you are handling EC2 Instance Profiles => AWS Roles, so perhaps it is to get the ecr tokens however in my AWS experience the EC2 instance InstanceProfile would be assigned a Role which attached policies. The EC2 instance can assume the role for the purpose of obtaining access to AWS resources.

So in summary the ideal would be allowing the user of Docker for AWS or default inclusion of ECR Policy in CloudFormation. This would give Docker Managers and Nodes access to communicate with ECR without the need of the docker user having to use docker-login at all. It would behave from the docker user's perspective as if it was a public repo. I know ECS does this seamlessly, but I'm not sure if there's some magic they do under the covers to assume the role on the EC2 instance during a docker pull.

In either case if the above is done, the tokens expiring is moot because all instances in the Docker for AWS would have access to ECR all the time based on the above Role / Polices.

kencochrane commented 7 years ago

@ambrons thank you for the details. This is what I'm thinking, let me know if that works for you.

We add a new template question that asks if you want to enable ECR support (yes or no, default to no)

if they say 'no':

do nothing, same as today.

If they say 'yes':

we add the Read only profile detailed here (http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly) to the IAM role we create. This will allow us to get the ECR tokens
we add a background process that will run the docker login command with the ECR tokens every X hours, to make sure the tokens are never expired.

for 2. I'm not sure if we need to run that on every host, or if we just need to do it on one leader, and swarm will automatically replicate it across, I'll need to do some research first.

Questions:

Would this work for you?
Would Read Only be fine, if someone needs more then read only, they will need to manually update the policy/role, which might not be ideal.

ambrons commented 7 years ago

@kencochrane

I agree with the Cloud Formation question for ECR support as well as adding the AmazonEC2ContainerRegistryReadOnly. Let that be a Parameter of the template. I don't think they would need read / write access unless you had a docker running on the swarm as say part of a CI tool with Docker in Docker and you built a docker image and wanted to push it to ECR after it was created. I'm not sure how often that would be the case.

I know for our immediate needs ReadOnly would be fine.

As for the part about ECR tokens every x hours. I'm still not sure that's necessary. I've include the following documentation that shows the interaction between EC2 Instance, InstanceProfile, Role, Policy, and an application running on EC2.

http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html

I know for the applications we've written-to-date that run in Docker on top of an EC2 Instance we do not have to handle tokens or the like as long as the EC2 Instance InstanceProfile + Role includes the proper permissions in a Policy associated to the Role. Which is why I was surprised when I added the ECR Policy manually to the generated Role from the Docker for AWS cloud formation it didn't just work.

My assumption is that the internal docker containers running for aws meta and console access might be interfering with the natural behavior of the EC2 Instance resource access model.

All of that said, regardless of the underpinnings I believe the net result would be the same. The user of docker swarm would not have to any additional steps to pull from ERC and it would just work assuming the above AmazonEC2ContainerRegistryReadOnly policy was associated with the Docker for AWS Role.

kencochrane commented 7 years ago

I agree with the Cloud Formation question for ECR support as well as adding the AmazonEC2ContainerRegistryReadOnly. Let that be a Parameter of the template. I don't think they would need read / write access unless you had a docker running on the swarm as say part of a CI tool with Docker in Docker and you built a docker image and wanted to push it to ECR after it was created. I'm not sure how often that would be the case.

I know for our immediate needs ReadOnly would be fine.

OK, we can start with read only for now, and see where that gets us.

As for the part about ECR tokens every x hours. I'm still not sure that's necessary. I've include the following documentation that shows the interaction between EC2 Instance, InstanceProfile, Role, Policy, and an application running on EC2.

http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html

I know for the applications we've written-to-date that run in Docker on top of an EC2 Instance we do not have to handle tokens or the like as long as the EC2 Instance InstanceProfile + Role includes the proper permissions in a Policy associated to the Role. Which is why I was surprised when I added the ECR Policy manually to the generated Role from the Docker for AWS cloud formation it didn't just work.

The IAM role will allow you to run the aws ecr commands to get the token, but you will still need to run aws ecr get-login every 12 hours, or else the token will expire. It doesn't automatically update your token for you. see http://docs.aws.amazon.com/cli/latest/reference/ecr/get-login.html for more details.

We could use something like this: https://github.com/awslabs/amazon-ecr-credential-helper which is a docker credential store, which will automatically handle collecting the ECR tokens for you, without the need for running docker login, etc.

ambrons commented 7 years ago

@kencochrane

The IAM role will allow you to run the aws ecr commands to get the token, but you will still need to run aws ecr get-login every 12 hours, or else the token will expire. It doesn't automatically update your token for you. see http://docs.aws.amazon.com/cli/latest/reference/ecr/get-login.html for more details. We could use something like this: https://github.com/awslabs/amazon-ecr-credential-helper which is a docker credential store, which will automatically handle collecting the ECR tokens for you, without the need for running docker login, etc.

I guess amazon-ecr-credential-helper is similiar to this which runs on each EC2 Container for docker: https://github.com/aws/amazon-ecs-agent

Okay so now that my magic has been debunked. Is the plan to have the Cloud Formation for SWARM set all of this up for ECR users? Such that if they choose to include the ECR Role Or Docker Repository in their Cloud Formation that the created EC2 instances will keep the tokens refreshed?

So to recap, now that I'm on the same page with you. I don't have an issue using aws ecr get-login on my local system over an SSH tunnel to the docker swarm manager to initially create deployments and stacks. That works as advertised. What isn't solved is the issue of after 12 hours the token used during the install expiring such that the scheduler gets borked if the service is updated or rebalanced post 12 hours.

So ignoring the IAM Role for a minute I guess the real issue is keeping the authentication refreshed post service or stack create / deployment.

Is there a way to auto configure / deploy https://github.com/awslabs/amazon-ecr-credential-helper as a global container along side the other global docker containers per cloud formation instance such that this all just works?

Having never used the docker credential helper before It appears you might not need AWS KEY / SECRET if you use the AWS_SDK_LOAD_CONFIG=true for assuming the role of the instance. Or perhaps that's in addition to AWS KEY and SECRET.

kencochrane commented 7 years ago

short answer. Yes, the goal is to allow someone to pick the "enable ECR support" and it will deal with the IAM profile, and keeping the Token up to date.

DanielMaier-BSI commented 7 years ago

Hi,

first of all, thanks for this cloudformation template. This is really awesome.

We are also very interested in this feature. Did you already experiment with this or is there even any schedule when ECR integration gets implemented?

Many thanks Daniel

jeffhuys commented 7 years ago

No ECR support is a really big problem for us. We want to move from ECS to Docker for AWS, but would like to have all our build scripts remain mostly the same...

kencochrane commented 7 years ago

@jeffhuys it is on the road map, current estimate is for it to arrive with 17.06-ce

kencochrane commented 7 years ago

It has been a while since I have commented on this issue, so I think it is time for an update, to let everyone know where we are with this issue.

Before I get into the current status, I'll give a little background.

The way ECR handles auth is by using IAM roles for your servers, that allow you to get short lived Auth tokens, which can be used to docker push/pull your images.

There are two common ways for doing this. you can do aws ecr get-login or use the amazon-ecr-credential-helper. This will get your auth tokens and put them in the right location so that docker can use them, just like any other credential.

Docker for AWS uses Swarmkit for our orchestration, and when you deploy your service, you can use the --with-registry-auth flag which will take the ECR credentials you have locally and pass them to the workers who are going to run your service containers, so they have access to those credentials.

This all works as it should, and if you never need the ECR auth tokens again, there would be no problem. Unfortunately, you will most likely need this token in the future. Even if you are not going to be doing a redeploy, you would need it, if a host goes down, and your containers get rescheduled to a new node. If this new node doesn't have the required images already loaded in it's image cache, it will need to pull down those images from ECR. If this happens after the ECR auth token expired, the image pulls from ECR will fail with invalid credentials.

Another important thing to consider, is that there are two different parts to docker, the server and the client. The server is the docker daemon that runs on the host and manages the containers, etc. The client is a CLI binary which interacts with a given docker daemon, it might be locally, or it could be on a remote server. Docker Auth credentials are handled by the docker client, and the docker server doesn't have access to them. Because of this, we can't use aws ecr get-login or amazon-ecr-credential-helper on swarm workers to get new auth tokens, when the current ones expire.

One work around would be to get a new auth token and then do a docker service update every X hours before the token expires, but this would cause a rolling update of the service to update the registry creds in all worker nodes, which isn't ideal. We have some other ideas as well, but they are also not very good. What ever solution we finally choose, we want to make sure it is something that works well for everyone, and won't make things worse in the long run.

The only viable solution is to make changes to the way docker swarm handles registry auth credentials. We are currently working with that team, to come up with a solution that works not just for ECR but for all registries. If you want to track the progress feel free to follow this issue: https://github.com/moby/moby/issues/24940

Until those changes are added to swarm, we are currently blocked. I'll be sure to update this issue, with our progress, to let everyone know where we stand.

blaketastic2 commented 7 years ago

As a work around for this(I didn't want to install aws cli on the manger node), I was able to utilize the aws cli on the "guide-aws" container that is started on my manager node. So by running the following, I can capture the login command:

docker exec -it guide-aws sh -c 'aws ecr get-login --region us-east-1'

Then, I've set up a cron to run on the manager every 6 hours to re-login the manager.

Would love some thoughts on this...

kencochrane commented 7 years ago

@blaketastic2 did this approach work for you?

At first glance, I don't think it will work because running the aws ect get-login will not update the password that is tied to the service. Swarm takes the password on service create, encrypts it and stores it in service config. If that password changes the service will need to be updated in order for the new password to be set.

blaketastic2 commented 7 years ago

As far as I can tell, this cron works: $ docker exec -it guide-aws sh -c 'aws ecr get-login --region us-east-1' | tr -d '\r' > login && ./login

Followed by this service update: $ docker service update --with-registry-auth serviceA

kencochrane commented 7 years ago

@blaketastic2 yes, if you run the service update after the aws ecr get-login that should work. It isn't ideal since it will cause a rolling update of the service. You will also need to remember to update every service that needs the new credentials.

If it works for you, then 👍

friism commented 7 years ago

From talking to @diogomonica, this is how progress will be made on this problem:

Docker will get plug'able external secret stores
--with-registry-auth transitions to using docker secrets for storing registry creds (right now I think it's just embedded in the service def)
We implement an EC2 instance metadata pseudo-secret store
That gets wired up so that swarm dynamically reads ECR registry creds as a secret, with that secret provided by the pseudo-secret EC2 instance metadata secrets plugin.

RehanSaeed commented 7 years ago

Any guestimates, rough or otherwise when we might be able to see this option in the cloud formation template?

ambrons commented 7 years ago

@friism @kencochrane this looks good to me, thanks for the update, much appreciated!

destebanm commented 7 years ago

Hi @friism, do you know when the solution you explained will be available?

@kencochrane, is there any way to update the service without doing a rolling upgrade? I mean only update credentials every x hours but without causing a rolling update of the service. This would be a good workaround while the final solution is developed.

Thanks!

kencochrane commented 7 years ago

As of right now, I do not know of a way to update the credentials without causing a rolling update of the service.

destebanm commented 6 years ago

OK, @kencochrane thanks! So I suppose that we have to wait until solution explained by @friism is available in EC2 instances and docker swarm. Meanwhile we have to do a rolling update after login in ECR, because is the only way to be sure that credentials are spread and new EC2 instances run the service succesfully

joostaafjes commented 6 years ago

@blaketastic2 I also don't want to install aws cli, but when doing

docker exec -it guide-aws sh -c 'aws ecr get-login --region eu-west-1'

I get the following error:

An error occurred (AccessDeniedException) when calling the GetAuthorizationToken operation: User: arn:aws:sts::xxxxxxxxxxxxx:assumed-role/Docker-xxxxx-swarm-ProxyRole-xxxxxxx/i-xxxxxxxxxxxxxxxxxxx is not authorized to perform: ecr:GetAuthorizationToken on resource: *

Did you also do something else to get this working?

thx

joostaafjes commented 6 years ago

@blaketastic2 already found out, also had to add the policy AmazonEC2ContainerRegistryReadOnly to the role.

thx for the elegant solution. I slight modified the oneliner to prevent writing a file...

so my cron looks like:

0 */6 * * * eval "$(docker exec -it guide-aws sh -c 'aws ecr get-login --region eu-west-1 --no-include-email'| tr -d '\r')" && docker service update --with-registry-auth main_xxxx

tomalok commented 6 years ago

would that crontab run on moby (as root or docker?), in the shell-aws container (presumably as docker), or elsewhere?

also, are the credentials tied to the specific service, or would updating one token service which wouldn't be affected by restarts be sufficient to distribute the credentials?

very much looking forward to a more elegant and permanent solution to this problem...

FrenchBen commented 6 years ago

@tomalok the guide-aws container is already running - The cronjob specified simply runs the command in the container. Since the container is available to your current user, you can simply set it up without being root.

tomalok commented 6 years ago

@FrenchBen yes, I understand that the aws ecr get-login command is running in guide-aws container. What I'm after is where crond is running? It's not set up by default on the manager node's shell-aws container. (FWIW, it appears that crond runs in the guide-aws container for doing cleanup, etc.)

If I patch the CloudFormation template (around where the manager and worker instances are having their /etc/docker/daemon.json set up and /home/docker permissions set) to add a crontab entry, presumably it would be the non-containerized crond on the moby instance itself that would be running the ECR auth cron job.

Probably wouldn't be a bad idea to have the CloudFormation template run aws ecr get-login (via guide-aws) to get the credentials in place from the get-go.

Another question is whether or not the credentials would need to be for the docker user, the root user, or both -- for the case where a new worker node comes up and needs to pull an ECR image for a global service.

FrenchBen commented 6 years ago

@tomalok if you're going to modify the template, why not add your own container that gets called in the cronjob - You can bind mount the proper locations for the credential token to be updated. For example, something like: https://hub.docker.com/r/behance/ecr-login/

tomalok commented 6 years ago

At the moment, the only alterations I'm doing to the CloudFront template is adding one line to the end of each of the UserData scripts for the manager and worker nodes, which pretty much just pulls an install script down from my S3 bucket and executes it.

wget -qO- http://s3-us-west-2.amazonaws.com/my-bucket/install-refresh-ecr-auth.sh | sh -ex

The install script does the rest of the heavy lifting:

writes the refresh-ecr-auth.sh script
executes it once to set initial ECR credentials
updates the crontab to call the refresh-ecr-auth.sh script once every 8 hours

The refresh-ecr-auth.sh script merely

gets the docker login command via docker exec guide-aws aws ecr get-login --region $AWS_REGION --no-include-email
evals the command twice, once for root, and once for the docker user eval $ECR_LOGIN; su docker sh -c "eval $ECR_LOGIN"

Because the moby instance's /home/docker is already bind mounted inside the shell-aws container, the appropriate credentials are already populated in ~/.docker/config.json when you SSH in to the node.

I'm still hand-adding the AmazonEC2ContainerRegistryReadOnly policy to the Worker and Proxy roles, but I'm assuming that should be relatively straightforward to add to the CloudFormation template, too.

FrenchBen commented 6 years ago

@tomalok Sounds like a good approach - any chance you'd consider sharing the (edited) script via a gist?

tomalok commented 6 years ago

Adding AmazonEC2ContainerRegistryReadOnly policy to the Worker and Proxy roles was rather straightforward, works nice.

I'll probably do a little cleanup and make the template/patch, and install script available via S3 somewhere.

tomalok commented 6 years ago

The patch (vs. 17.09.0-ce-aws1) and the resultant patched CloudFormation template are available at http://tomalok-d4aws.s3-website-us-west-2.amazonaws.com/. Also included is a link to the install script source for pre-deployment review.

"Works for me."

mRoca commented 6 years ago

Hello, I've created an image to fix this problem, based on the same solution as https://github.com/docker/for-aws/issues/5#issuecomment-336250817 : a service will run on a manager, doing a $(aws ecr get-login ...) and a docker service update --with-registry-auth ... for each service using ECR each 4 hours.

See https://github.com/mRoca/docker-swarm-aws-ecr-auth , in theory you just have to create the service (see the README) on your aws swarm stack.

tomalok commented 6 years ago

FWIW, I'm working on an updated/enhanced template patch vs. 17.12.0-ce-aws1

FrenchBen commented 6 years ago

@tomalok If you'd like, I can add the ECR policy into the default template - Users will only need to install the ECR refresher manually. wget -qO- http://s3-us-west-2.amazonaws.com/tomalok-d4aws/install-refresh-ecr-auth.sh | sh -ex This will prevent you from having to keep the template up to date and will only require the installer refresher to be added.

Additionally we could add the installer script to this repo, allowing people to contribute, and wrapping it in a container. WDYT?

tomalok commented 6 years ago

@FrenchBen - here's my "TODO" list for the next iteration of the CloudFront patch...

[x] Base it on the latest version of the CloudFormation template.
[x] Embed scripts/files into the CloudFormation template, no longer need to pull the install script from my S3 bucket or elsewhere.
[ ] Make the ECR Auto-Auth feature (and other features) optional when deploying/updating the CloudFormation template.
[ ] Adding/removing Proxy and Worker nodes optionally updates Route53 A records. Likely based on or using https://github.com/30mhz/autoscaling-route53-lambda?
[ ] Option to disable the creation of an external ELB. Not sure yet how the l4controller-aws container will react to that not being there, but for small numbers of small nodes, DNS + Route53 health checks + Docker Swarm's mesh is often sufficient. (One ELB costs more than (3) reserved t2.micro's...)

I've moved stuff over to https://github.com/tomalok/docker4aws, master branch is currently the old patch... tomalok2 branch is where I've (slowly) been working on the TODO stuff. Of course, would be nice if some/all of these tweaks ended up going official, someday... ;)

FrenchBen commented 6 years ago

@tomalok Happy to include the policy as part of the deployment (doesn't seem to hurt anything)

[ ] Base it on the latest version of the CloudFormation template.

For the rest, it seems easier to build an ECR container installer, so that you can install/enable/disable this feature without requiring a ton of work on the users' side. WDYT?

The ELB flag is an interesting one and more than likely applies to the deployments with existing VPC. The LB would need some tweaks, which can be made against its source: https://github.com/docker/infrakit/tree/master/pkg/provider/aws/plugin/loadbalancer

tomalok commented 6 years ago

@FrenchBen with regards to the ECR policy, if it's not being used (ECR Auto-Auth turned off), then it's probably best to not have the policy -- "least privileges" and all that. Would be even better if it were configurable to permit specific ECR resources, instead of just "*".

I'm not quite sure I understand what you mean by an "ECR container installer"? The way I've been proceeding, the option to disable/enable/configure this (and other) features is in the CloudFormation template, so the user would choose/configure when deploying or updating the stack... Installation/setup/updates are pretty much a one-stop-shop at that point, instead of having to do something more to enable/configure any additional feature(s).

FrenchBen commented 6 years ago

@tomalok wrt to the installer/config, it's to tell the user to run 1 container to do the work that's done within the CloudFormation template. This allows a user to easily enable/disable the ECR script.

Having a dynamic policy, seems a bit more work. Since there is no auth, does it really become a security issue to allow the instance access?

Mobe91 commented 6 years ago

@tomalok First of all, thanks for providing this script.

But I have a weird issue with it. I set it up on all my manager and worker nodes and apparently it works since I am able to docker pull images from every ECR repository when SSHing into one of my instances without performing any manual docker login. The strange thing is, I just happened to update my cluster to a new worker instance size via a CloudFormation stack update. This naturally does a rolling upgrade on my worker nodes. Everything worked fine but then none of my services that use an image from one of my ECR repositories could be restarted on the new nodes. When I run docker service ps on these services, it always says No such image which usually means "I cannot access the ECR repository". When I do a manual docker pull on a node that is targeted by a service replica deployment, the deployment suddenly starts working.

Any ideas about this?

tomalok commented 6 years ago

Sorry for the delay... I've been sidetracked on a bunch of things...

@Mobe91 I haven't yet tested the case of updating instance size... it should work, because the instance UserData set up should take care of the initial needfuls. I'll give it a test the next time I have a spare moment or three.

Also, @FrenchBen, etc. because of all the sidetracking, and the fact that there's been a new set of Docker for AWS templates released in the interim, I'm considering a new approach... Leaning towards reimplementing as a script that downloads a specified template (new VPC vs. existing VPC, edge vs. stable, etc.), loads the CloudFormation JSON into a native object for modification (adding/removing features), and then outputs the new resulting template.

tomalok commented 6 years ago

@Mobe91 -- found some time to test tonight, although my Docker for AWS swarm is running 2017.12.0-ce-aws1 with a newer, more self-contained ECR auto-authentication patch applied (no more dependency on the S3 bucket).

Bumped my swarm up to t2.small, everything went okay, ~docker/.docker/config.json auths had been freshly updated on all the new management nodes (I have no worker nodes at present)... But then realized I didn't have a service running out of my ECR registry.

I docker service create ...'d a service from my ECR with replicas on all nodes, and then resized my swarm back down to t2.micro. The first node that came online did not immediately start the service (because there were already enough replicas in operation), but when autoscaling removed an "old" instance, a replica of the service started there just fine, automagically without intervention.

Maybe the problems you experienced has something to do with the older version and/or a not-yet-ready dependency or weird race condition... or if this happened on a worker node, it's possible there's something different on those? That's another situation I haven't tested...

Mobe91 commented 6 years ago

@tomalok Thanks for taking the time to test this. I discovered that this is not an issue with instance size change but it also happens when I simply increase the number of service replicas via docker service update for an existing service. The scaling fails on all nodes with no such image that do not yet have a running instance of the same service (i.e. no local image is present). This happens on both manager and worker nodes. I have attached the CloudFormation Template that I used to create the swarm. The installation of the ECR refresher happens in lines 1588 and 2215. Can you spot any issue with that?

tomalok commented 6 years ago

Apart from pulling the script from a different S3 bucket (with an additional comment at the top BEFORE the shebang #! line -- probably not the best place for a comment, but since it's piped into sh -ex shouldn't hurt anything...), I'm not seeing anything obvious other than it's patching 18.12.1-ce-aws1 instead of 18.12.0-ce-aws1.

Testing by adding a worker node to my swarm... Although I can't ssh into that node (at least from outside of the security group), I see from the EC2 console log that it appears to have done the aws ecr get-login okay. Bumping my service's replicas up a notch... And it's running fine...

~ $ docker service update --replicas 4 jn-svc
jn-svc
overall progress: 4 out of 4 tasks
1/4: running   [==================================================>]
2/4: running   [==================================================>]
3/4: running   [==================================================>]
4/4: running   [==================================================>]
verify: Service converged
~ $ docker service ps jn-svc
ID                  NAME              IMAGE                                                        NODE                                          DESIRED STATE       CURRENT STATE            ERROR               PORTS
pcru1f5seswg        jn-svc.1          000000000000.dkr.ecr.us-west-2.amazonaws.com/jn/svc:latest   ip-172-31-99-999.us-west-2.compute.internal   Running             Running 8 hours ago
isbv94cfkblg         \_ jn-svc.1      000000000000.dkr.ecr.us-west-2.amazonaws.com/jn/svc:latest   n6y13g7qfru5pbnb75hfuyed6                     Shutdown            Running 8 hours ago
wk87ry9qyscc        jn-svc.2          000000000000.dkr.ecr.us-west-2.amazonaws.com/jn/svc:latest   ip-172-31-99-99.us-west-2.compute.internal    Running             Running 8 hours ago
qk85rj74xzxn         \_ jn-svc.2      000000000000.dkr.ecr.us-west-2.amazonaws.com/jn/svc:latest   75ug0racy5kjfp545k24oou73                     Shutdown            Running 8 hours ago
5qc8wbexhvxu        jn-svc.3          000000000000.dkr.ecr.us-west-2.amazonaws.com/jn/svc:latest   ip-172-31-9-999.us-west-2.compute.internal    Running             Running 8 hours ago
tmf4xrz4rbgl         \_ jn-svc.3      000000000000.dkr.ecr.us-west-2.amazonaws.com/jn/svc:latest   rsttef2xj2ls6dbgvbzsxuzba                     Shutdown            Running 8 hours ago
p81q4844e5fu        jn-svc.4          000000000000.dkr.ecr.us-west-2.amazonaws.com/jn/svc:latest   ip-172-31-99-9.us-west-2.compute.internal     Running             Running 14 seconds ago

Mobe91 commented 6 years ago

@tomalok I guess you created the service using --with-registry-auth? Then I think this issue explains it all...

I just tried this again. When I create a service on a 1 node swarm using docker service create --with-registry-auth everything works fine. Immediatly after that, I spawn an additional swarm node and scale up the service to 2 replicas. Again, everything works fine - but that is just because the credentials stored with the service at the time of its inital creation are still valid. This all falls apart when the service credentials expire and I guess this is the problem that am experiencing all the time, e.g. during cluster update. Of course, your ECR credential refresher script cannot do anything about this because swarm only considers the stored service credentials instead of the up-to-date client credentials. 😞

This also explains why docker pulling the images worked for me while, at the same time and on the very same node, swarm reported No such image.

I don't see how this is supposed to work at all...

tomalok commented 6 years ago

Testing again, it's been a couple days since I downsized the service and nodes to 3. Instead of adding a node and another replication, I terminate one of the node instances entirely... So, no opportunity to re-specify --with-registry-auth...

Starting by terminating a node that's not the master node... Container 3 came up on one of the remaining two nodes... Though, of course, the image was already on that node. Verified that the replacement node currently does not have the image, but it does have credentials in its .docker/config.json. Doesn't seem to be a way to docker service update --mode global so I added another replica, and with that, I believe I have finally reproduced your problem.

I have some ideas how I might be able to work around this...

[moments later...] It appears to have sorted itself out while I wasn't looking, after initially complaining that it couldn't find the image... But upon closer inspection, it just chose a different node to put the replica on.

@Mobe91 is your service started with --replicas _n_ or --mode global?

tomalok commented 6 years ago

I think I've solved it, at least for services currently running in the swarm -- I've updated the script at https://s3-us-west-2.amazonaws.com/tomalok-d4aws/install-refresh-ecr-auth.sh

The trick is that after the ECR credentials are refreshed, the refresh script immediately does docker service update --with-registry-auth for every service in the swarm. The only thing this update does (I think!) is provide updated per-service credentials to all swarm docker daemons -- it shouldn't trigger any image updates, etc.

@Mobe91 - give the updated install script a try, and let me know if it works for you (or if you notice any other side effects).

Mobe91 commented 6 years ago

@tomalok I had the same idea and ran docker service update --with-registry-auth on every service before I upgraded my swarm to 18.03.0-ce yesterday. 😀 Sadly, it did not have any effect, the repositories could still not be accessed by the new nodes... So it seems that the stored credentials are not updated by this command.

tomalok commented 6 years ago

@Mobe91 I think it needs to be done right after the new node joins the swarm, not before. The docker service update --with-registry-auth will only update the containers' credentials on the existing nodes, and new nodes do not automagically receive this information as they come up.

tomalok commented 6 years ago

@Mobe91 could you try updating to the following template?

https://s3-us-west-2.amazonaws.com/tomalok-d4aws/18.03.0-ce-aws1-tomalok1.tmpl

...while this does not incorporate most of things on my TODO from my February post, here's what it does contain...

[x] the template is generated from a tweak-template script, which pulls the CloudFormation template from a specified URL, and applies the modifications. This makes it much easier to update when a new official Docker for AWS template is released, or to generate "no VPC" and/or "edge" variants.
[x] updates the LaunchConfiguration names so that AutoScaling is updated when patch version changes, instead of just being updated when Docker for AWS version changes.
[x] embeds the most recent install-refresh-ecr-auth.sh script in the template itself -- there is no longer a dependency on S3 to pull the install script.
[x] the refresh-ecr-auth.sh now does docker service update --with-registry-auth for every service that exists in the swarm -- once when the node's instance is instantiated, and then once every 8 hours.

Mobe91 commented 6 years ago

I think it needs to be done right after the new node joins the swarm, not before. The docker service update --with-registry-auth will only update the containers' credentials on the existing nodes, and new nodes do not automagically receive this information as they come up.

I thought the point of storing the credentials in the service definition is that nodes can retrieve this information from a central location (from state maintained by the manager nodes) whenever they require it. But clearly, I don't know exactly so I will give your updated script a try 😄

docker-archive / for-aws