aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.07k stars 606 forks source link

ECSTools - Error getting the VMNetwork adapter #2416

Closed Mabiro closed 4 years ago

Mabiro commented 4 years ago

Summary

There seems to be an issue with the Windows_Server-2019-English-Full-ECS_Optimized AMIS where the ECS-Agent is sometimes having issues connecting to the ECS Cluster.

The issue seems random but will happen on different AWS accounts sporadically.

Description

We are using a Launch Configuration with the following user script data:

<powershell>
      [Environment]::SetEnvironmentVariable("ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE", $TRUE, "Machine")
      Initialize-ECSAgent -Cluster application-cluster -EnableTaskIAMRole -LoggingDrivers '[\"json-file\",\"awslogs\"]'
    </powershell>

The AMI used in this Launch Configuration is always the one found in the parameter store under /aws/service/ami-windows-latest/Windows_Server-2019-English-Full-ECS_Optimized. It is worth mentionning that although we are currently using 03.18, we started having this issue before its actual release.

It is concerning, as we will often wait for an ECS-Agent to register, but it might never happen due to some error when running ECSTools. I have attached the user data execution log.

Expected Behavior

The ECS-Agent reliably connects to the ECS cluster without errors.

Observed Behavior

The ECS-Agent will sometimes fail and will require a complete stop and re-exécution of the user data script before correctly registering.

Environment Details

I have a snapshot of a EC2 volume where this issues happened, if it's of any use to help and debug the issue.

Supporting Log Snippets

UserdataExecution.log

yhlee-aws commented 4 years ago

Thank you for reporting this issue. Looping in our windows container team for this issue - @sandeepindraganti

sandeepindraganti commented 4 years ago

Thanks for reporting this issue. From our initial analysis we understood that this is an intermittent issue and our team is working on it.
@Mabiro Thanks your offering your volume snapshot to help us. Were you able to reproduce this issue consistently with the volume snapshot you have?

Mabiro commented 4 years ago

@sandeepindraganti Sorry for the late response. We've been trying to restore the snapshot but we're having issues trying to reset the password. I've followed some documentation on how to reset EC2 passwords, but it doesn't seem to be working when creating a new instance from the snapshot.

Are you able to reproduce this issue at all? It happened twice again this week on two different client account. If you can't reproduce, I'll spend some more time trying to reproduce it myself consistently.

I've seen somewhere that some people do this in their user_data script:

Remove-Item -Recurse C:\ProgramData\Amazon\ECS\Cache
Import-Module ECSTools

Is it something we should consider doing? From what I am seeing, 2020.3.18 uses ECS tools 1.37 while 1.38 has been released, would it help?

sandeepindraganti commented 4 years ago

@Mabiro Our team has tried reproducing this issue and we didn't see the the failure at our end. It would be great if you can share the snapshot and steps to reproduce. In addition to this could you please help us with the following information

  1. What is the instance type you are using at your end?
  2. Were you able to reproduce this issue If you run the Initialize-ECSAgent.ps1 script as a windows scheduled task instead of having it as part of instance userdata? Thanks in advance for the help.
Mabiro commented 4 years ago

@sandeepindraganti

I have been looking at ECSTools.psm1 while comparing the user data execution log I shared in this issue and something feels wrong regarding the 2016 vs 2019 implementations.

When looking at my logs, it looks like we're creating an APIPA vEthernetAdapter, which seems to be required only on a 2016 implementation, as shown by this code in ECSTools.psm1

    $NatAdapterName2019 = 'vEthernet (nat)*'
    $NatAdapterName2016 = '*APIPA*'
    $( $defaultNatAdapterList = Get-NetAdapter -ErrorAction:Ignore | Where-Object {$_.Name -like $NatAdapterName2019} ) 2>$null
    if ($null -eq $defaultNatAdapterList) { # if no adapters exist, assume this is a Windows Server 2016 implementation
        # Windows Server 2016 
        $NatAdapterName = $NatAdapterName2016
        $( $defaultNatAdapterList = Get-NetAdapter -ErrorAction:Ignore | Where-Object {$_.Name -like $NatAdapterName2016} ) 2>$null
        if ($null -eq $defaultNatAdapterList) {
            # Create NAT adapter
            if (-not $(Create-VEthernetNatAdapter2016)) {
                return
            }
        }
    } else {
        $NatAdapterName = $NatAdapterName2019
    }

Is it possible that some timing issue would make it so the adapter searched for with the 2019 adapter name is not ready yet? I see a few loops with delays in this script when looking for particular adapters, but this one does not. It feels to me like the script wrongly assumes that our instance is running on 2016.

If my findings are not relevant, here is the answer to your questions:

I'll keep trying to find a reliable way of reproducing this with one our snapshots and let you know.

lalwanin2020 commented 4 years ago

@Mabiro We think we have identified the cause and are working on the fix.

Mabiro commented 4 years ago

@lalwanin2020 Thank you for the feedback.

Is there any release schedule regarding fixes made to the Windows ECS AMI, just so we can have a rough idea of when that issue could be resolved?

lalwanin2020 commented 4 years ago

@Mabiro We are working on the fix and will release it as early as possible.

lalwanin2020 commented 4 years ago

@Mabiro We have made the fix and it is available in this month's AMI. Please try it out.

yhlee-aws commented 4 years ago

We are closing this issue as fix is available in the latest ECS Optimized Windows AMI. Please feel free to re-open if problem persists.