aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.05k stars 323 forks source link

Agent's core manager can't be nil #118

Open 75lb opened 6 years ago

75lb commented 6 years ago

Hi.. I'm working through this tutorial, creating everything from scratch as described using the Amazon Linux 2 AMI but fail at step 3b as SSM is unable to find an instance with a correctly configured agent.

Taking a closer look at my instance, I'm seeing this in the logs. Is this a bug or am I missing something?

$ sudo cat /var/log/amazon/ssm/errors.log
2018-09-17 13:12:46 ERROR [Stop @ agent.go.99] Agent's core manager can't be nil
2018-09-17 17:09:50 ERROR [Stop @ agent.go.99] Agent's core manager can't be nil
2018-09-17 17:24:28 ERROR [ModuleRequestStop @ session.go.211] [MessageGatewayService] stopping controlchannel with error, %!s(<nil>)
dineshboddula commented 6 years ago

Hello, I had faced the same error logs.

Please try to delete ssm directory from /var/lib/amazon/ssm and restart start the ssm agent again.

In my case, SSM stored the figureprints of previous version this directory. i had to delete it and restart the agent then it registered the instance in managed instances and session manager was visible.

nehalaws commented 5 years ago

Thank you for posting the issue, is it possible to attach all agent logs? If the instance is not shown as managed instance, that indicates an issue with IAM role attached to it. Logs will help identify if the agent wasn't able to reach SSM service or not.

75lb commented 5 years ago

I don't have the instance or logs anymore, I'm afraid. I worked through this tutorial creating everything (including IAM roles) from scratch.. does the tutorial work correctly for you?

nehalaws commented 5 years ago

Yes I just followed the tutorial and found the instance in the managedInstances list.

rk295 commented 5 years ago

Hi, I'm also seeing this error in the log. I am using the OpenVPN Appliance, which is Ubuntu 16.04 amd64 with OpenVPN installed.

I see the following in the log:

==> errors.log <==
2018-10-18 15:44:39 ERROR [Stop @ agent.go.100] Agent's core manager can't be nil
==> hibernate.log <==
2018-10-18 15:50:30 ERROR Health ping failed with error - UnrecognizedClientException: The security token included in the request is invalid.
    status code: 400, request id: 040bcb0e-4bda-4e79-9756-3e731d414840

The instance is booted in eu-west-1 with an Instance profile which has AWS managed policy AmazonEC2RoleforSSM attached. My Centos instances have the same instance role attached and are working fine.

I've verified I can curl the meta-data url from within the instance and that is working. I'm out of ideas :(

I'm using version 2.3.169.0-1 of the .deb downloaded from the bucket linked from the tutorial.

Can this issue be reopened please? I notice it was closed with no resolution :(

nehalaws commented 5 years ago

2018-10-18 15:50:30 ERROR Health ping failed with error - UnrecognizedClientException: The security token included in the request is invalid. status code: 400, request id: 040bcb0e-4bda-4e79-9756-3e731d414840 This error indicates that the agent is not able to reach SSM service and in hibernation mode, can you please verify if the instance can reach ssm.eu-west-1.amazonaws.com?

rk295 commented 5 years ago

Thanks for re-opening. If I curl that host, I see the following:

root@openvpnas2:~# curl -kv https://ssm.eu-west-1.amazonaws.com/
*   Trying 52.94.217.30...
* Connected to ssm.eu-west-1.amazonaws.com (52.94.217.30) port 443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_CBC_SHA1
*    server certificate verification SKIPPED
*    server certificate status verification SKIPPED
*    common name: ssm.eu-west-1.amazonaws.com (matched)
*    server certificate expiration date OK
*    server certificate activation date OK
*    certificate public key: RSA
*    certificate version: #3
*    subject: CN=ssm.eu-west-1.amazonaws.com
*    start date: Mon, 13 Aug 2018 00:00:00 GMT
*    expire date: Tue, 13 Aug 2019 12:00:00 GMT
*    issuer: C=US,O=Amazon,OU=Server CA 1B,CN=Amazon
*    compression: NULL
* ALPN, server did not agree to a protocol
> GET / HTTP/1.1
> Host: ssm.eu-west-1.amazonaws.com
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 404 Not Found
< x-amzn-RequestId: 44b9f250-070b-47d4-a677-32698e16273b
< Content-Length: 29
< Date: Wed, 24 Oct 2018 21:20:39 GMT
<
<UnknownOperationException/>
* Connection #0 to host ssm.eu-west-1.amazonaws.com left intact

Which all looks ok I guess? After bouncing the agent I get the following in amazon-ssm-agent.log:

2018-10-18 16:33:03 INFO Entering SSM Agent hibernate - UnrecognizedClientException: The security token included in the request is invalid.
    status code: 400, request id: 64616960-799b-49a6-a7fb-9b1000cde4f1
2018-10-24 21:20:47 INFO Got signal:terminated value:0xb91950
2018-10-24 21:20:47 INFO Stopping agent
2018-10-24 21:20:47 ERROR Agent's core manager can't be nil
2018-10-24 21:20:47 INFO Entering SSM Agent hibernate - NoCredentialProviders: no valid providers in chain. Deprecated.
    For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Thanks for replying @nehalaws !

rk295 commented 5 years ago

Ok, so... I got to the bottom of this :)

There was a sequence of events, which all had to happen in the correct order for this to become an issue. Its solved now though

  1. Terraform created the IAM role and attached the Policy
  2. Terraform booted the instance (without SSM installed).
  3. A Terraform mistake caused the role to be deleted
  4. The above mistake was noticed and fixed.
  5. Everything "looked" ok from the AWS console.
  6. The SSM agent was installed, producing the error above!

However, it turns out that the instance was now actually referencing the now deleted role.

The fix was to remove the role from the instance, then attach the correct (newly created) one. Boot the instance and SSM started to work.

I spotted this because curling the meta-data endpoint for the security-credentials URI failed with:

% curl http://169.254.169.254/latest/meta-data/iam/security-credentials/basic-ec2
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

However, once the new role was attached and the server booted that same url responds with:

% curl http://169.254.169.254/latest/meta-data/iam/security-credentials/basic-ec2/
{
  "Code" : "Success",
  "LastUpdated" : "2018-10-25T14:23:25Z",
  "Type" : "AWS-HMAC",
  "AccessKeyId" : "REDACTED",
  "SecretAccessKey" : "REDACTED",
  "Token" : "REDACTED",
  "Expiration" : "2018-10-25T20:58:23Z"

I'm typing this all up in this issue in the hope it might help somebody else who encounters this bizarre sequence of events!

NoahSDS commented 5 years ago

I am getting the "Entering SSM Agent hibernate - UnrecognizedClientException: The security token included in the request is invalid." error in my amazon-ssm-agent.log file. I've tried the fixes listed above and have not been able to fix it.

I am able to get data from http://169.254.169.254/latest/meta-data/iam/security-credentials/web-role and I can connect to https://ssm.us-west-2.amazonaws.com/

This is a Windows instance. Another instance on the same subnet uses the same role and is working fine with SSM.

damonmaria commented 3 years ago

For others trying to solve this. It took for me a combination of the above and then having to wait quite some time (30 minutes - 1 hour). After all that a final restart of the ssm service did it.