aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.06k stars 324 forks source link

Instance registers the docker0 ip address #553

Open nunofernandes opened 11 months ago

nunofernandes commented 11 months ago

Hello,

We have an onprem server (rocky linux 8) with SSM agent (amazon-ssm-agent-3.2.2016.0-1.x86_64).

At AWS Fleet Manager we have that instance registered with the ip address from docker0 (172.17.0.1):

image

It was working fine until we lost the dhcp for a few hours and now even after restarting the SSM agent, I always get the docker0's IP registered.

If I do an ifconfig docker0 down; systemctl restart amazon-ssm-agent.service; ifconfig docker0 up it works (registers the correct ip) but after some time, it gets back to the previous docker0 ip address registered in SSM.

I think it's the code at agent/platform/platform.go that is sorting the interfaces differently (guessing):

    if interfaces, err = net.Interfaces(); err == nil {
        interfaces = filterInterface(interfaces)
        sort.Sort(byIndex(interfaces))
        candidates := make([]net.IP, 0)

What would be the best option here (except rebooting the server)?

Aperocky commented 1 month ago

Thanks for reaching out regarding this. We have recently restructured our interface ip reporting to avoid a quadratic computation expansion due to golang syscall behavior on Linux dumping the entire routetable with each interface.Addrs(). This leads to escalating CPU usage for system with many network interfaces.

This does not address this problem, as the order of the interfaces returned are decided by the OS. However, we do want to verify if this still exist in agent version 3.3.1142.0, and if it does, we can evaluate further on a similar fix as your PR.

nunofernandes commented 1 month ago

Hello.. Tried with the latest one

# yum update https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
....
Upgraded:
  amazon-ssm-agent-3.3.987.0-1.x86_64

# rpm -qi amazon-ssm-agent
Name        : amazon-ssm-agent
Version     : 3.3.987.0
Release     : 1
Architecture: x86_64
Install Date: 2024-10-23T17:24:47 CEST
Group       : Amazon/Tools
Size        : 127685837
License     : Apache License, Version 2.0
Signature   : RSA/SHA1, 2024-09-23T12:51:02 CEST, Key ID bc1f495c97dd04ed
Source RPM  : amazon-ssm-agent-3.3.987.0-1.src.rpm
Build Date  : 2024-09-23T11:56:15 CEST
Build Host  : build.amazon.com
Relocations : (not relocatable)
Packager    : Amazon.com, Inc. <http://aws.amazon.com>
Vendor      : Amazon.com
URL         : http://docs.aws.amazon.com/ssm/latest/APIReference/Welcome.html
Summary     : Manage EC2 Instances using SSM APIs
Description :
This package provides Amazon SSM Agent for managing EC2 Instances using SSM APIs

That is not the version you said: 3.3.1142.0. Waiting for that one to land on the RPM repo/url. With the version available, it still happens:

image

Once that version lands on the repo, I can try it.. Do you know when that version will be available?

Aperocky commented 1 month ago

The version is deploying through regions now and will reach global sometimes next week, for testing purposes you can receive the latest version here:

$ sudo yum update https://s3.eu-north-1.amazonaws.com/amazon-ssm-eu-north-1/latest/linux_amd64/amazon-ssm-agent.rpm
Last metadata expiration check: 1 day, 19:49:40 ago on Mon Oct 21 19:52:51 2024.
amazon-ssm-agent.rpm                                                                              8.9 MB/s |  24 MB     00:02
Dependencies resolved.
==================================================================================================================================
 Package                            Architecture             Version                         Repository                      Size
==================================================================================================================================
Upgrading:
 amazon-ssm-agent                   x86_64                   3.3.1142.0-1                    @commandline                    24 M

Transaction Summary
==================================================================================================================================
Upgrade  1 Package
nunofernandes commented 1 month ago

Hello,

Just tested that new version and I still get the ip address from docker0:

image

So, the issue is still there :(

Aperocky commented 1 month ago

I see, this looks like we need a dedicated way to filter this out if we decide to go there, when this feature was first designed, we did not define the exact interface to return. We will evaluate potential changes and/or documentation to define this feature. One of the first thing that comes to mind is to go for default NI but the ways to capture that would be distinct across the different platforms we support, and since golang library does not have that capability out of the box, we have to implement potentially unstable methods for different OS as they evolve. That would need to be evaluated further before we take it up.

nunofernandes commented 1 month ago

That is why I sent the patch https://github.com/aws/amazon-ssm-agent/pull/555 that would allow the user to exclude certain interfaces that they know aren't meant to be used. Let me know if that is the route forward and if so, I can rebase the patch with the current codebase.

Aperocky commented 1 month ago

Unfortunately that route is blocked now as we do not filter via interface anymore, the reason for that being golang syscall behavior dumping the entire routetable when looking up the property of a single interface. This means for hosts with large number of interfaces (e.g. high number of containers). The CPU consumption of this behavior becomes quadratic if we loop over and filter interfaces, and it is very important to us that we keep our resource consumption low.

nunofernandes commented 1 month ago

Well.. what about the following scheme (haven't seen the current codebase so, I'm just in suggestion mode here):

  1. fetches the ip addresses from the EXCLUDED interfaces (shouldn't be that many interfaces to loop and getting the ip address of a know interface should be faster (maybe without looping through all interfaces and causing the cpu consumption)).
  2. use the current code to find the ip address that would be announced to SSM
  3. if the ip found in step2 matches one of the ones that are part of the EXCLUDED list, then continue to the next ip

Would that work?