aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.07k stars 606 forks source link

Initialize-ECSAgent Initialize-ECSHost needs attention #2272

Closed mfortin closed 4 years ago

mfortin commented 4 years ago

Summary

We use the Windows ECS Optimized AMI as a starting AMI, on which we run our automation to install different security scanning tools and other scripts. Once completed, we run sysprep and create a new AMI. That AMI is then used to start the EC2 hosts that are part of our ECS cluster.

The problem is that, when invoking Initialize-ECSAgent -EnableTaskIAMRole, it fails as there are multiple adapters found using the Get-NetAdapter

> Get-NetAdapter

Name                      InterfaceDescription                    ifIndex Status       MacAddress             LinkSpeed
----                      --------------------                    ------- ------       ----------             ---------
vEthernet (nat) 4         Hyper-V Virtual Ethernet Adapter #4          16 Up           00-15-5D-91-4B-C9        10 Gbps
vEthernet (nat)           Hyper-V Virtual Ethernet Adapter              9 Disconnected 00-15-5D-05-80-15        10 Gbps
vEthernet (nat) 3         Hyper-V Virtual Ethernet Adapter #3           7 Disconnected 00-15-5D-B8-F4-82        10 Gbps
Ethernet 2                Amazon Elastic Network Adapter                6 Up           02-38-0E-E0-7D-B8        25 Gbps
vEthernet (nat) 2         Hyper-V Virtual Ethernet Adapter #2           3 Disconnected 00-15-5D-B5-FC-07        10 Gbps

Description

Here are the logs from the user-data where Initialize-ECSAgent is executed:

2019-11-11T14:08:36Z - [INFO]:Runtime is already installed.
2019-11-11T14:08:37Z - [INFO]:Docker version 19.03.2, build c92ab06ed9
2019-11-11T14:08:37Z - [INFO]:Configuring ECS Host...
2019-11-11T14:08:37Z - [INFO]:Checking Hyper-V Network adapter
2019-11-11T14:08:52Z - [INFO]:Default vEthernet adapter found for nat. Using this adapter.
2019-11-11T14:08:52Z - [INFO]:VMNetwork adapter found with mac: 00-15-5D-05-80-15
2019-11-11T14:08:52Z - [INFO]:Checking for network adatper with mac: 00-15-5D-05-80-15
2019-11-11T14:08:57Z - [INFO]:Network adapter found.
2019-11-11T14:08:57Z - [INFO]:Network adapter found with mac 00-15-5D-05-80-15 on interface 9
2019-11-11T14:08:57Z - [INFO]:Getting subnet info from docker...
2019-11-11T14:09:00Z - [INFO]:Docker subnet: 172.24.176.0/20
2019-11-11T14:09:00Z - [INFO]:Docker gateway: 172.24.176.1
2019-11-11T14:09:03Z - [INFO]:Getting net ip address
2019-11-11T14:09:04Z - [INFO]:IP address not found.
Name                           Value
----                           -----
PrefixLength                   32
IPAddress                      169.254.170.2
InterfaceIndex                 9

2019-11-11T14:09:04Z - [INFO]:Creating new virtual network adapter ip...
2019-11-11T14:09:05Z - [INFO]:Virtual network adapter ip created: 169.254.170.2 169.254.170.2
2019-11-11T14:09:06Z - [INFO]:Waiting for it to become available on the device...
2019-11-11T14:09:08Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:08Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:10Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:10Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:12Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:12Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:14Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:14Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:16Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:16Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:29Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:29Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:30Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:30Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:32Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:32Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:34Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:36Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:41Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:41Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:42Z - [INFO]:IP address exists. Address state: Tentative
2019-11-11T14:09:42Z - [INFO]:Waiting for ip address "169.254.170.2" to become 'Preferred' on interface index 9... (sleeping 1)
2019-11-11T14:09:43Z - [ERROR]:There was a problem getting the net ip address for the virtual adapter

In C:\Program Files\WindowsPowerShell\Modules\ECSTools\ECSTools.psm1, the function Initialize-ECSHost has $( $vmNetAdapter = Get-NetAdapter 'vEthernet (nat)' -ErrorAction:Ignore ) 2>$null where it should be $( $vmNetAdapter = Get-NetAdapter 'vEthernet (nat)*' -ErrorAction:Ignore | Where-Object Status -eq 'Up') 2>$null

Expected Behavior

Expected behavior is that the adapter being available is used to setup the port proxy.

Observed Behavior

A disconnected adapter is used and port proxy failed to be configured.

Environment Details

Supporting Log Snippets

mfortin commented 4 years ago

My current workaround:

Get-NetAdapter -Name "vEthernet (nat)"  | Where-Object Status -eq Disconnected | Rename-NetAdapter -NewName "vEthernet (nat) - Disconnected"
Get-NetAdapter -Name "vEthernet (nat)*" | Where-Object Status -eq Up           | Rename-NetAdapter -NewName "vEthernet (nat)"
somujay commented 4 years ago

My current workaround:

Get-NetAdapter -Name "vEthernet (nat)"  | Where-Object Status -eq Disconnected | Rename-NetAdapter -NewName "vEthernet (nat) - Disconnected"
Get-NetAdapter -Name "vEthernet (nat)*" | Where-Object Status -eq Up           | Rename-NetAdapter -NewName "vEthernet (nat)"

Thanks for reporting the issue. We'll fix this.

mfortin commented 4 years ago

2019, sorry for not being specific.

EliseMistaken commented 4 years ago

A fix for this issue will be included in the January ECS-Optimized Windows AMI release.

fenxiong commented 4 years ago

New ECS-Optimized Windows AMI has been released and contains a fix for this. Please upgrade to the latest version, and let us know if you still have issue. Thanks!

mfortin commented 4 years ago

@fenxiong the version bundled on the newer Windows AMI is worst.

> Initialize-ECSHost
2020-01-27T17:05:53Z - [INFO]:Configuring ECS Host...
You cannot call a method on a null-valued expression.
At C:\Program Files\WindowsPowerShell\Modules\ECSTools\ECSTools.psm1:1247 char:13
+             $defaultNatAdapterList = $defaultNatAdapterList | Where-O ...
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvokeMethodOnNull
2020-01-27T17:05:53Z - [ERROR]:Multiple adapters found matching the default name 'vEthernet (nat)*', and more than one has a status of 'Up'. This is an unsupported configuration. Unable to select adapter.
Multiple adapters found matching the default name 'vEthernet (nat)*', and more than one has a status of 'Up'. This is an unsupported configuration. Unable to select adapter.
At C:\Program Files\WindowsPowerShell\Modules\ECSTools\ECSTools.psm1:68 char:5
+     Throw $Message
+     ~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (Multiple adapte...select adapter.:String) [], RuntimeException
    + FullyQualifiedErrorId : Multiple adapters found matching the default name 'vEthernet (nat)*', and more than one has a status of 'Up'. This is an unsupported configuration. Unable to select adapter.
> Get-NetAdapter
Name                      InterfaceDescription                    ifIndex Status       MacAddress             LinkSpeed
----                      --------------------                    ------- ------       ----------             ---------
vEthernet (nat)           Hyper-V Virtual Ethernet Adapter             15 Disconnected 00-15-5D-09-F3-1F        10 Gbps
vEthernet (nat) 5         Hyper-V Virtual Ethernet Adapter #5          14 Disconnected 00-15-5D-7D-78-A8        10 Gbps
vEthernet (nat) 7         Hyper-V Virtual Ethernet Adapter #7          20 Up           00-15-5D-0A-C4-41        10 Gbps
vEthernet (nat) 4         Hyper-V Virtual Ethernet Adapter #4          12 Disconnected 00-15-5D-BF-95-C8        10 Gbps
vEthernet (nat) 3         Hyper-V Virtual Ethernet Adapter #3          10 Disconnected 00-15-5D-01-EF-C6        10 Gbps
vEthernet (nat) 2         Hyper-V Virtual Ethernet Adapter #2           4 Disconnected 00-15-5D-1A-91-A2        10 Gbps
vEthernet (nat) 6         Hyper-V Virtual Ethernet Adapter #6           3 Disconnected 00-15-5D-D9-53-9B        10 Gbps

Here is a diff to make it work:

$ diff ECSTools-from-ami.psm1 ECSTools.psm1
1245c1245
<         if ($defaultNatAdapterList.Count -gt 1) {
---
>         if ( @($defaultNatAdapterList).Count -gt 1) {
1247,1248c1247,1248
<             $defaultNatAdapterList = $defaultNatAdapterList | Where-Object ($_.Status.ToLower() -eq 'up')
<             if ($defaultNatAdapterList.Count -eq 1) { # use that adapter
---
>             $defaultNatAdapterList = $defaultNatAdapterList | Where-Object Status -eq 'up'
>             if ( @($defaultNatAdapterList).Count -eq 1) { # use that adapter
1250c1250
<             } elseif ($defaultNatAdapterList.Count -eq 0) { # We can't know which adapter to wait for, fail fast since there should only be one
---
>             } elseif ( @($defaultNatAdapterList).Count -eq 0) { # We can't know which adapter to wait for, fail fast since there should only be one
fenxiong commented 4 years ago

@mfortin apologies for the issue. I will reopen this and we will look into it.

sandeepindraganti commented 4 years ago

@mfortin Sorry to hear that the new version introduced issues to you . We are working on fixing this issue.

kraigor commented 4 years ago

@sandeepindraganti any updates on this issue?

sandeepindraganti commented 4 years ago

Sorry for the inconvenience caused . We are planning to include the fix as part of our next ECS AMI release which is planned during March.

Mabiro commented 4 years ago

Hello @sandeepindraganti,

We are currently using the _Windows_Server-2019-English-Full-ECSOptimized-2020.03.18 AMI and are facing some sporadic issues with it. Is it the AMI release you mentionned in your comment?

I was redirected here when trying to find information about my issue, but I can't quite tell if I should open a new issue.

We are using this AMI along with the Initialize-ECSAgent user data script, but it often fails to register as an ECS-Agent. I am including the user data execution log I was able to get from one of my snapshots.

Let me know if this is unrelated, and I will open an new issue.

Thank you!

UserdataExecution.log

EliseMistaken commented 4 years ago

Hi Mabiro, the 03.18 AMIs do have the fix for this thread included. However, I haven't been able to reproduce your issue, and it seems to be different from the issue described in this thread. Some more information would be useful.

Is the EnableTaskIAMRoles flag enabled? How often are you seeing this issue? If you connect to a non-connected instance and run "Get-NetAdapter" does it include an adapter with the name "vEthernet (nat)"? Have you made any modifications to the AMI?

Thanks!

Mabiro commented 4 years ago

Hi Louie,

Thank you for the quick feedback.

Here's the user data script that we're using inside our launch configuration. We use an auto scaling group that we set to 1 or 0 depending on when we need to have an actual windows ecs-agent

<powershell>
      [Environment]::SetEnvironmentVariable("ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE", $TRUE, "Machine")
      Initialize-ECSAgent -Cluster application-cluster -EnableTaskIAMRole -LoggingDrivers '[\"json-file\",\"awslogs\"]'
    </powershell>

We have multiple AWS accounts (one per client) and this issue happens sporadically on different accounts. Out of around 10 accounts, it happens at least twice a week. There doesn't seem to be an actual pattern. I tried reproducing it by terminating and restarting an instance multiple times, but to no avail. It just happens randomly on random accounts. So far, the fix has been to simply terminate the hanging machine and wait for the new one to boot and it usually connects right away, without a need to retry once more.

As for the AMI, we are using a terraform script which updates the launch configuration with the latest AMI (as found under /aws/service/ami-windows-latest/Windows_Server-2019-English-Full-ECS_Optimized in the parameter store). We aren't modifying it.

I currently do not have access to a machine that wasn't able to connect, but I will try to run Get-NetAdapter once this error happens again. I do have snapshots of volumes, maybe that is enough to do the test?

Thanks!

EDIT: For what it's worth, we had the issue before the 03.18 AMIS. So it doesn't seem like an issue that arose from this new version.

EliseMistaken commented 4 years ago

That user data script looks fine, and thank you for noting that it happens on other AMIs as well, that's really useful. That said this does seem unrelated, so please open up a separate issue to track this.

fierlion commented 4 years ago

The separate issue is being tracked here: https://github.com/aws/amazon-ecs-agent/issues/2416