buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC
https://buildkite.com/docs/quickstart/elastic-ci-stack-aws
MIT License
418 stars 271 forks source link

agent failing on ecr get-login #178

Closed micheletest closed 7 years ago

micheletest commented 7 years ago

Especially in the morning, you may see: ERROR: denied: Your Authorization Token has expired. Please run 'aws ecr get-login' to fetch a new one. When this happens, it's an agent that has been running for more than a day.

Here are some excerpts from a build that had this happen. I modified it and removed parts so let me know if this is sufficient (removed any identifying characteristics of what this job is doing).

$ /etc/buildkite-agent/hooks/environment
Setting up the environment  1s
Sourcing CloudFormation environment...

Starting an SSH Agent...
Agent pid 3354

Running global pre-command hook 0s
$ /etc/buildkite-agent/hooks/pre-command
Running build script    1s
$ docker-compose -f docker-compose.yml down
docker-compose -f docker-compose.yml run our_command bash -c "command is here"
docker-compose -f docker-compose.yml down
Removing network daily_default
WARNING: Network daily_default not found.
Creating network "daily_default" with the default driver
Creating daily-server_1
Pulling container_name (container details)...
ERROR: denied: Your Authorization Token has expired. Please run 'aws ecr get-login' to fetch a new one.
lox commented 7 years ago

Hi @micheletest sounds like something is going wrong with ECR authentication. Are you running a recent stack version with the ECRAccessPolicy parameter?

lox commented 7 years ago

The auth token for ECR is 12 hours, which sounds like re-auth somehow isn't happening.

micheletest commented 7 years ago

hi @lox I don't think we are setting the ECRAccessPolicy, which would explain this. I am out of the office, but can confirm if that fixes this early next week. Thanks for looking at this.

lox commented 7 years ago

@micheletest alternately, set AWS_ECR_LOGIN=true in your env vars.

micheletest commented 7 years ago

@lox Thanks - setting the AWS_ECR_LOGIN in your env vars worked.

lox commented 7 years ago

Great! Updating to the latest stack should fix it too.

On Wed., 30 Nov. 2016 at 8:28 am, Michele Martone notifications@github.com wrote:

@lox https://github.com/lox Thanks - setting the AWS_ECR_LOGIN in your env vars worked.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/buildkite/elastic-ci-stack-for-aws/issues/178#issuecomment-263704561, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA9jneyEN1UQosxSKjGfWs35MV25mLZks5rDJkCgaJpZM4K6rdQ .

lox commented 7 years ago

Urgh, this is a regression. Will fix.

lox commented 7 years ago

Fixed in master.