aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 319 forks source link

[EKS] [request]: Update Windows Nodes ContainerD version #2163

Open tip-dteller opened 1 year ago

tip-dteller commented 1 year ago

Community Note

Tell us about your request Update EKS Managed Windows nodes to a use a more updated version of Containerd

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Background: We have some Windows solution which we migrated from EC2 machines into a Container format. The solution requires us to have a graceful shutdown process as it common in any application scenario. Windows behaves differently when it comes to Signal Termination Handling than Linux. The equivalent of SigTerm in Windows is CTRL_SHUTDOWN_EVENT. For more details: Our application is rather old, and is built on Framework 4.7.2 - so our container is Windows2019-Server-Core

The issue: We have an application that requires to process itself to completion and then shutdown safely. As per Kubernetes documentation you may use the field 'terminationGracePeriodSeconds' which effectively should wait N number of seconds until forcibly killing the container\pod unless the process within exits before the given time. This Field is honored by Kubelet and Kubernetes but has no affect on the Innards of the container.

Hence we ventured out an found out that you have to specify a Registry key to make Windows wait more time before shutting down the process. This Key is defaults to 5 seconds - Meaning no matter what you have in Windows - unless changed before Container startup - the windows will shut itself down after 5 seconds. you can read about it here in this comment - https://github.com/moby/moby/issues/25982#issuecomment-426441183

So we added that key, to a different value - say 30000 ms which is 30 seconds. However, we discovered that its not honored by EKS - on any version.

Here's why: Following this issue - https://github.com/microsoft/Windows-Containers/issues/164

It is noted that the library used in containerd - hcsshim. Doesn't respect that shutdown call but it was implemented in the version 0.9.7 of hcsshim and baked into containerd 1.6 and 1.7. You can read about it here

Digging deeper it was specifically added at earliest version 1.6.19

EKS Windows Core 2019 and 2022 support only version 1.6.6! as specified here

In essence this means that EKS doesn't natively support any GracefulShutdown for windows containers.

Are you currently working around this issue?

KlwntSingh commented 1 year ago

Hello @tip-dteller, Support for containerd 1.6.19 in EKS Windows AMIs is WIP. EKS Windows AMI with conainerd 1.6.18 will be released either in October or November. After this release is complete, we will priortize to release containerd 1.6.19.

tip-dteller commented 1 year ago

so as it stands at right now the 2 viable options are what i have already listed?

kylzhng1 commented 1 year ago

Hi @tip-dteller , can you clarify what you mean by the second option?

We recommend using EKS optimized as base AMI and use powershell script to install new containerd. Please note that additional changes might be required in AMI to make new containerd compatible.

tip-dteller commented 1 year ago

Hi @kylzhng1 , Option 1 would be exactly as you had proposed with EKS Optimized AMI, i dont see any other option here as its already tailored for EKS and ofc playing around with containerD version to get it right until AWS officially supports it.

Option 2:

Following the logic we need we ran some tests. We plan to implement k8s Lifecycle hook - PreStop hook. Why? We ran a test on current 1.24 EKS with Containerd 1.6.6 - Karpenter managed node and we saw the following happen to the pod:

We can conclude it as such: Pod runs -> pod is killed -> Hook prevents termination from happening -> pod continues to run but in Terminating state -> hook completes -> pod dies.

so we used in out favor - we set high values for sleeping and we have the app publish when its done with Main() and exits. the lifecycle hook - watches that main() process and if its done before the count - it forcibly shutdown.

ChrisMcKee commented 1 year ago
$ProgressPreference = "SilentlyContinue"
# Ensure the required module is installed
if (-not (Get-Module -ListAvailable -Name Microsoft.PowerShell.Archive)) {
    Install-Module Microsoft.PowerShell.Archive -Scope CurrentUser -Force
}

# Ensure the required utility is available
if (-not (Get-Command -Name 'tar' -ErrorAction SilentlyContinue)) {
    throw "The 'tar' command is not available. Please ensure it's installed."
}

# Create the tempcontainerd folder if it doesn't exist
$destFolder = "c:\tempcontainerd"
if (-not (Test-Path $destFolder)) {
    New-Item -Path $destFolder -ItemType Directory
}

# Download the .tar.gz file
$url = "https://github.com/containerd/containerd/releases/download/v1.6.24/cri-containerd-cni-1.6.24-windows-amd64.tar.gz"
$downloadPath = Join-Path $destFolder "cri-containerd-cni-1.6.24-windows-amd64.tar.gz"
Invoke-WebRequest -Uri $url -OutFile $downloadPath

# Un-tar-gz the file
tar -zxvf $downloadPath -C $destFolder

# # Take ownership
# takeown /F "C:\Program Files\containerd\containerd-shim-runhcs-v1.exe"

# # Grant permissions to the current user
# icacls "C:\Program Files\containerd\containerd-shim-runhcs-v1.exe"

get-process -Name "containerd-shim-runhcs-v1"
Stop-Process -Name containerd-shim-runhcs-v1 -Force

stop-service WinHttpAutoProxySvc
stop-service containerd

# Copy the extracted contents to c:\Program Files\containerd
Copy-Item -Path "$destFolder\*" -Destination "c:\Program Files\containerd" -Recurse -Force

start-service WinHttpAutoProxySvc
start-service containerd

Stopping the service seemed to be the main issue with doing this inline; probably work better in a custom image creation so it's had chance to reboot.

Not sure if this actually fixes your issue. I was trying to solve a different one 😂

tzifudzi commented 11 months ago

The script offered by @ChrisMcKee could help as a workaround with upgrading containerd 'on the fly', that is by stopping the service and manually upgrading by fetching the latest binaries. Haven't tested it to confirm.

In the interim, the EKS Windows team have upgraded containerd to 1.6.18. See https://docs.aws.amazon.com/eks/latest/userguide/eks-ami-versions-windows.html.

tip-dteller commented 11 months ago

@tzifudzi I see the change, thanks!! And I wonder - why does it make sense to have the kubelet on windows startup type - Manual? we had a node stop working for an unknown reason yet...and kubelet never came back up.

PS C:\Windows\system32> (get-service kubelet).startType
Manual

There's also this at startup and it doesnt appear everytime a WinNode spawns.

Warning  ContainerGCFailed        58m                kubelet          rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing open //./pipe/containerd-containerd: The system cannot find the file specified."

\

tip-dteller commented 11 months ago

@tzifudzi - It has been observed over time - that a windows application that enters a cycle of "CrashLoopBackOff" it somehow leads the node to crash.

----     ------                 ----                    ----       -------
  Warning  ImageGCFailed          3m38s (x53 over 4h23m)  kubelet    rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing open //./pipe/containerd-containerd: The system cannot find the file specified."

We didnt observe this behavior with earlier containerd issue, but it may have been there.


 onditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 08 Nov 2023 11:57:29 +0200   Wed, 08 Nov 2023 00:06:43 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 08 Nov 2023 11:57:29 +0200   Wed, 08 Nov 2023 00:06:43 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 08 Nov 2023 11:57:29 +0200   Wed, 08 Nov 2023 00:06:43 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Wed, 08 Nov 2023 11:57:29 +0200   Wed, 08 Nov 2023 07:26:32 +0200   KubeletNotReady              [container runtime is down, PLEG is not healthy: pleg was last seen active 4h31m28.5347131s ago; threshold is 3m0s]
tzifudzi commented 10 months ago

Hi @tip-dteller, apologies only seeing your messages now.

...why does it make sense to have the kubelet on windows startup type - Manual?

We have a background process in place to watch for the status of the kubelet service and automatically attempts to restart to always ensure kubelet is running. If the startup type is leading to failures in kubelet starting up this is something we would look at to ensure this doesn't happen.

..it somehow leads the node to crash.

Are you still facing this issue? Can you reproduce this? This is a valid issue to open for us to look into if the answers are yes. CrashLoopBackOff should only affect the pod in question and not cause the node to crash. Having a repro is not required but more ideal to investigate more speedily.

tip-dteller commented 10 months ago

@tzifudzi No worries, thanks the reply :)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: iis
  labels:
    app: iis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: iis
  template:
    metadata:
      labels:
        app: iis
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      terminationGracePeriodSeconds: 120
      volumes:
      - name: script
        configMap:
          name: script
      containers:
      - name: iis-server
        resources:
          requests:
            cpu: "300m"
            memory: "1Gi"
          limits:
            memory: "2Gi"
            cpu: 1
        image: mcr.microsoft.com/windows/servercore:10.0.17763.4131
        ports:
        - containerPort: 80
        volumeMounts:
        - name: script
          mountPath: "C:\\pskiller.ps1"
          subPath: pskiller.ps1
        command: 
        - powershell.exe
        -  "C:\\pskiller.ps1"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: script
data:
  pskiller.ps1: |
    # Define a list of non-essential services to exclude from restart
    $excludedServices = @(
        "wuauserv",  # Windows Update
        "mpssvc"    # Windows Firewall
        # Add more service names here
    )

    # Get a list of all services
    $services = Get-Service

    # Iterate through the services and restart those that are not in the excluded list
    while ($true){
      foreach ($service in $services) {
          if ($excludedServices -notcontains $service.Name -and $service.Status -ne "Running") {
              try {
                  Restart-Service -Name $service.Name
                  Write-Host "Restarted service $($service.DisplayName) (Name: $($service.Name))"
              } catch {
                  Write-Host "Failed to restart service $($service.DisplayName) (Name: $($service.Name)): $_"
              }
          }
      }
    }
tzifudzi commented 10 months ago

@tip-dteller Will you please open a support case with AWS so this can be tracked? In the meantime will aim to take a look but can't promise I can look at it as early as you might want. Opening a support case helps track it and have EKS team prioritize looking into it.