Open tip-dteller opened 1 year ago
Hello @tip-dteller,
Support for containerd 1.6.19
in EKS Windows AMIs is WIP.
EKS Windows AMI with conainerd 1.6.18
will be released either in October or November. After this release is complete, we will priortize to release containerd 1.6.19
.
so as it stands at right now the 2 viable options are what i have already listed?
Hi @tip-dteller , can you clarify what you mean by the second option?
We recommend using EKS optimized as base AMI and use powershell script to install new containerd. Please note that additional changes might be required in AMI to make new containerd compatible.
Hi @kylzhng1 , Option 1 would be exactly as you had proposed with EKS Optimized AMI, i dont see any other option here as its already tailored for EKS and ofc playing around with containerD version to get it right until AWS officially supports it.
Option 2:
Following the logic we need we ran some tests. We plan to implement k8s Lifecycle hook - PreStop hook. Why? We ran a test on current 1.24 EKS with Containerd 1.6.6 - Karpenter managed node and we saw the following happen to the pod:
We can conclude it as such: Pod runs -> pod is killed -> Hook prevents termination from happening -> pod continues to run but in Terminating state -> hook completes -> pod dies.
so we used in out favor - we set high values for sleeping and we have the app publish when its done with Main() and exits. the lifecycle hook - watches that main() process and if its done before the count - it forcibly shutdown.
$ProgressPreference = "SilentlyContinue"
# Ensure the required module is installed
if (-not (Get-Module -ListAvailable -Name Microsoft.PowerShell.Archive)) {
Install-Module Microsoft.PowerShell.Archive -Scope CurrentUser -Force
}
# Ensure the required utility is available
if (-not (Get-Command -Name 'tar' -ErrorAction SilentlyContinue)) {
throw "The 'tar' command is not available. Please ensure it's installed."
}
# Create the tempcontainerd folder if it doesn't exist
$destFolder = "c:\tempcontainerd"
if (-not (Test-Path $destFolder)) {
New-Item -Path $destFolder -ItemType Directory
}
# Download the .tar.gz file
$url = "https://github.com/containerd/containerd/releases/download/v1.6.24/cri-containerd-cni-1.6.24-windows-amd64.tar.gz"
$downloadPath = Join-Path $destFolder "cri-containerd-cni-1.6.24-windows-amd64.tar.gz"
Invoke-WebRequest -Uri $url -OutFile $downloadPath
# Un-tar-gz the file
tar -zxvf $downloadPath -C $destFolder
# # Take ownership
# takeown /F "C:\Program Files\containerd\containerd-shim-runhcs-v1.exe"
# # Grant permissions to the current user
# icacls "C:\Program Files\containerd\containerd-shim-runhcs-v1.exe"
get-process -Name "containerd-shim-runhcs-v1"
Stop-Process -Name containerd-shim-runhcs-v1 -Force
stop-service WinHttpAutoProxySvc
stop-service containerd
# Copy the extracted contents to c:\Program Files\containerd
Copy-Item -Path "$destFolder\*" -Destination "c:\Program Files\containerd" -Recurse -Force
start-service WinHttpAutoProxySvc
start-service containerd
Stopping the service seemed to be the main issue with doing this inline; probably work better in a custom image creation so it's had chance to reboot.
Not sure if this actually fixes your issue. I was trying to solve a different one 😂
The script offered by @ChrisMcKee could help as a workaround with upgrading containerd 'on the fly', that is by stopping the service and manually upgrading by fetching the latest binaries. Haven't tested it to confirm.
In the interim, the EKS Windows team have upgraded containerd to 1.6.18
. See https://docs.aws.amazon.com/eks/latest/userguide/eks-ami-versions-windows.html.
@tzifudzi I see the change, thanks!! And I wonder - why does it make sense to have the kubelet on windows startup type - Manual? we had a node stop working for an unknown reason yet...and kubelet never came back up.
PS C:\Windows\system32> (get-service kubelet).startType
Manual
There's also this at startup and it doesnt appear everytime a WinNode spawns.
Warning ContainerGCFailed 58m kubelet rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing open //./pipe/containerd-containerd: The system cannot find the file specified."
\
@tzifudzi - It has been observed over time - that a windows application that enters a cycle of "CrashLoopBackOff" it somehow leads the node to crash.
---- ------ ---- ---- -------
Warning ImageGCFailed 3m38s (x53 over 4h23m) kubelet rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing open //./pipe/containerd-containerd: The system cannot find the file specified."
We didnt observe this behavior with earlier containerd issue, but it may have been there.
onditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 08 Nov 2023 11:57:29 +0200 Wed, 08 Nov 2023 00:06:43 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 08 Nov 2023 11:57:29 +0200 Wed, 08 Nov 2023 00:06:43 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 08 Nov 2023 11:57:29 +0200 Wed, 08 Nov 2023 00:06:43 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Wed, 08 Nov 2023 11:57:29 +0200 Wed, 08 Nov 2023 07:26:32 +0200 KubeletNotReady [container runtime is down, PLEG is not healthy: pleg was last seen active 4h31m28.5347131s ago; threshold is 3m0s]
Hi @tip-dteller, apologies only seeing your messages now.
...why does it make sense to have the kubelet on windows startup type - Manual?
We have a background process in place to watch for the status of the kubelet
service and automatically attempts to restart to always ensure kubelet is running. If the startup type is leading to failures in kubelet starting up this is something we would look at to ensure this doesn't happen.
..it somehow leads the node to crash.
Are you still facing this issue? Can you reproduce this? This is a valid issue to open for us to look into if the answers are yes. CrashLoopBackOff should only affect the pod in question and not cause the node to crash. Having a repro is not required but more ideal to investigate more speedily.
@tzifudzi No worries, thanks the reply :)
apiVersion: apps/v1
kind: Deployment
metadata:
name: iis
labels:
app: iis
spec:
replicas: 1
selector:
matchLabels:
app: iis
template:
metadata:
labels:
app: iis
spec:
nodeSelector:
kubernetes.io/os: windows
terminationGracePeriodSeconds: 120
volumes:
- name: script
configMap:
name: script
containers:
- name: iis-server
resources:
requests:
cpu: "300m"
memory: "1Gi"
limits:
memory: "2Gi"
cpu: 1
image: mcr.microsoft.com/windows/servercore:10.0.17763.4131
ports:
- containerPort: 80
volumeMounts:
- name: script
mountPath: "C:\\pskiller.ps1"
subPath: pskiller.ps1
command:
- powershell.exe
- "C:\\pskiller.ps1"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: script
data:
pskiller.ps1: |
# Define a list of non-essential services to exclude from restart
$excludedServices = @(
"wuauserv", # Windows Update
"mpssvc" # Windows Firewall
# Add more service names here
)
# Get a list of all services
$services = Get-Service
# Iterate through the services and restart those that are not in the excluded list
while ($true){
foreach ($service in $services) {
if ($excludedServices -notcontains $service.Name -and $service.Status -ne "Running") {
try {
Restart-Service -Name $service.Name
Write-Host "Restarted service $($service.DisplayName) (Name: $($service.Name))"
} catch {
Write-Host "Failed to restart service $($service.DisplayName) (Name: $($service.Name)): $_"
}
}
}
}
@tip-dteller Will you please open a support case with AWS so this can be tracked? In the meantime will aim to take a look but can't promise I can look at it as early as you might want. Opening a support case helps track it and have EKS team prioritize looking into it.
Community Note
Tell us about your request Update EKS Managed Windows nodes to a use a more updated version of Containerd
Which service(s) is this request for? EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Background: We have some Windows solution which we migrated from EC2 machines into a Container format. The solution requires us to have a graceful shutdown process as it common in any application scenario. Windows behaves differently when it comes to Signal Termination Handling than Linux. The equivalent of SigTerm in Windows is CTRL_SHUTDOWN_EVENT. For more details: Our application is rather old, and is built on Framework 4.7.2 - so our container is Windows2019-Server-Core
The issue: We have an application that requires to process itself to completion and then shutdown safely. As per Kubernetes documentation you may use the field 'terminationGracePeriodSeconds' which effectively should wait N number of seconds until forcibly killing the container\pod unless the process within exits before the given time. This Field is honored by Kubelet and Kubernetes but has no affect on the Innards of the container.
Hence we ventured out an found out that you have to specify a Registry key to make Windows wait more time before shutting down the process. This Key is defaults to 5 seconds - Meaning no matter what you have in Windows - unless changed before Container startup - the windows will shut itself down after 5 seconds. you can read about it here in this comment - https://github.com/moby/moby/issues/25982#issuecomment-426441183
So we added that key, to a different value - say 30000 ms which is 30 seconds. However, we discovered that its not honored by EKS - on any version.
Here's why: Following this issue - https://github.com/microsoft/Windows-Containers/issues/164
It is noted that the library used in containerd - hcsshim. Doesn't respect that shutdown call but it was implemented in the version 0.9.7 of hcsshim and baked into containerd 1.6 and 1.7. You can read about it here
Digging deeper it was specifically added at earliest version 1.6.19
EKS Windows Core 2019 and 2022 support only version 1.6.6! as specified here
In essence this means that EKS doesn't natively support any GracefulShutdown for windows containers.
Are you currently working around this issue?