aws / aws-codedeploy-agent

Host Agent for AWS CodeDeploy
https://aws.amazon.com/codedeploy
Apache License 2.0
329 stars 187 forks source link

Codedeploy Agent missing gem all instances broken. #122

Closed Jinkxed closed 7 years ago

Jinkxed commented 7 years ago

So out of nowhere today we started getting failed deployments on the vast majority of our servers.

Looking at the instances and the codedeploy agent I see:

/etc/init.d/codedeploy-agent status
Bundler could not find compatible versions for gem "gli":
  In Gemfile:
    aws_codedeploy_agent was resolved to 0.1, which depends on
      gli (~> 2.5)

Could not find gem 'gli (~> 2.5)', which is required by gem 'aws_codedeploy_agent', in any of the sources.

The codedeploy agent is dead. Manually going into the directory and doing a bundle install and rebooting the machine fixes it. But rebooting our entire AWS account is kinda a pain in the butt.

When I restart the agent after running bundle I see access denied messages until the box is rebooted. Anyway to fix atleast this part without rebooting?

bhare commented 7 years ago

Also having this issue

Jinkxed commented 7 years ago

The issue seems to be with an updated aws-sdk-core gem having conflicts with an older one.

Removing the old aws-sdk-core gem seems to resolve it.

asaf-erlich commented 7 years ago

That's a very odd error message and fix combination.

Can you please provide as much information as possible about your hosts (OS, Kernel Version, etc.)? It would very much help is in root causing this as fast as possible.

Thanks, -Asaf

Jinkxed commented 7 years ago

Latest Amazon Linux AMI. Latest kernel versions.

Installed via the puppet module: https://github.com/walkamongus/puppet-codedeploy/

Had the aws-sdk-core gem locked to version 2.2.3 as any version later than that seem to break codedeploy-agent. This was done awhile back and part of fixing the current issue was removing this lock.

Performing these steps resolved the issue:

Removed the puppet code locking the aws-sdk-gem for the host OS to 2.2.3.

  cd /opt/codedeploy-agent
  bundle install
  /etc/init.d/codedeploy-agent restart

Prior puppet code that I was using to lock the host os gem:

  # This is a pain in the butt way to ensure the aws-sdk-core doesn't have the newest 2.3.0
    # gem which is incompatible with codedeploy-agent and strait up breaks it.
    exec { 'remove-aws-sdk-core-incompatible-versions':
      command => 'gem uninstall aws-sdk-core -aqIx',
      unless  => 'test `gem list --local | grep -q "aws-sdk-core (2.2.37)"; echo $?` -ne 1',
      path    => ['/usr/bin', '/bin'],
    }
    package { 'aws-sdk-core':
      ensure    => '2.2.37',
      provider  => 'gem',
      require   => Exec['remove-aws-sdk-core-incompatible-versions']
    }
asaf-erlich commented 7 years ago

Which version of ruby did you have installed? What version of the gem was incompatible? Is 2.2.37 the compatible one or the incompatible one?

asaf-erlich commented 7 years ago

Any chance you can help by providing steps to reproduce this error? I've been unable to reproduce it but I would really love to fix this permanently so it doesn't bother you or other customers again.

Jinkxed commented 7 years ago

I'd love to but will be on Thursday. I'm heading out of town for the 4th.

I'll try and get everything together for you when I get back.

asaf-erlich commented 7 years ago

Thank you, have a good vacation and happy 4th

asaf-erlich commented 7 years ago

Also now or when you get back if you could try running the command ls /opt/codedeploy-agent/vendor/gems/ and tell me what you see as the output.

asaf-erlich commented 7 years ago

Also I wonder if you knew what the output of gem list was when the error occurred. Since it's fixed now it isn't as relevant but it might hold a clue as to why the errors are occurring. Maybe the way we're using bundler, which we added to resolve all these different dependency issues, is doing something weird when certain combinations of gems are installed.

Jinkxed commented 7 years ago

Thankfully it should be like this on our previous image, so should be easy to recreate. I'll check when I get back.

Jason733i commented 7 years ago

Any updates on the fix for this issue? Many customers have been affected over the holiday weekend.

ralfizzle commented 7 years ago

I also ran into this issue last week and reverted to the previous version of the code deploy agent for now.

tkling commented 7 years ago

We ran into this issue as well. It looks like the bad version of the agent was yanked? Newly spun up instances are on agent v1.0-1.1106 and can't find an update to install, whereas our broken instances were using agent v1.0-1.1225.

bensie commented 7 years ago

AWS has acknowledged the issue on the status page

image

mmerkes commented 7 years ago

Starting at 9:30 AM PDT June 30th AWS CodeDeploy released an update to the AWS CodeDeploy agent that prevents agents on some Linux instances from polling for commands in the US-EAST-1 region. The AWS CodeDeploy team is actively working on a fix that will eliminate the need for manual actions on your behalf. In the meantime, if you are experiencing deployment timeout errors, you can fix the issue by uninstalling the CodeDeploy agent and then reinstalling the CodeDeploy agent.

If you're running codedeploy-agent-1.0-1.1225, you can also fix the issue by running the following command:

sudo service codedeploy-agent restart

We will post an update as soon as the issue has been resolved.

cesc1989 commented 7 years ago

AWS guys sorted the issue. I just successfully deployed our API to production.

mmerkes commented 7 years ago

We released a fix 7/5 that should have resolved this issue for all customers. You shouldn't need to take any action as the auto-update feature in the agent should automatically get the latest version.