Azure / AKS

Azure Kubernetes Service
1.92k stars 284 forks source link

[BUG] `microsoft-defender-publisher-ds` crash looping - AKS version 1.29.2 #4240

Open ghantasunil opened 3 weeks ago

ghantasunil commented 3 weeks ago

Describe the bug microsoft-defender-publisher-ds pod is crash looping after upgrading the AKS cluster to 1.29.2

Expected behavior do not crash loop and restart normally.

Screenshots

image

image

Environment (please complete the following information):

Additional context

Logs from the container

Fluent Bit v2.1.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

time="2024-04-25T15:39:21Z" level=info msg="Initializing OMS Client for WorkspaceID: ea060b82-aadb-4a5c-bdc3-f78b13e462e7"
time="2024-04-25T15:39:21Z" level=info msg="Registering a new Certificate"
time="2024-04-25T15:39:23Z" level=info msg="Generating serial number"
time="2024-04-25T15:39:23Z" level=info msg="Creating a new certificate"
time="2024-04-25T15:39:23Z" level=info msg="Registering a new certificate"
time="2024-04-25T15:39:24Z" level=error msg="error encountered during client initializationPost \"https://ea060b82-aadb-4a5c-bdc3-f78b13e462e7.oms.opinsights.azure.com/AgentService.svc/LinuxAgentTopologyRequest\": read tcp 10.244.4.4:59576->10.100.20.35:443: read: connection reset by peer"
panic: Error encountered during client initialization Post "https://ea060b82-aadb-4a5c-bdc3-f78b13e462e7.oms.opinsights.azure.com/AgentService.svc/LinuxAgentTopologyRequest": read tcp 10.244.4.4:59576->10.100.20.35:443: read: connection reset by peer

goroutine 17 [running, locked to thread]:
main.FLBPluginInit(0xc000002601?)
    /code/src/Rome-Detection-Tivan-Publisher/src/plugin/plugin_connector.go:101 +0x65b
JoeyC-Dev commented 3 weeks ago

I put an AKS with 1.29.2 for a whole night and did not see any issue: image image

So I don't think this issue is reproduceable, or not easy to reproduce.
Maybe you can try disable and enable defender on AKS to see if the issue resolved, but I believe you should open a ticket so SE will help you.

adjurdjevic commented 3 weeks ago

I have same issue on one AKS cluster but with AKS version 1.27.7 so it isn't related to an AKS upgrade. Issue started last night after kured restart of nodes in cluster.

adjurdjevic commented 3 weeks ago

It seems that this issue only hitting environment where Azure Monitor Private Link Scope is in use for connection to Azure Monitor, environment which doesn't use AMPLS run fine.

ghantasunil commented 3 weeks ago

Thanks for looking into it, @JoeyC-Dev I understand that it is not related to an AKS upgrade, but I am able to replicate this issue on multiple clusters. I tried restarting the microsoft-defender-publisher-ds Daemonset, but it was of no use.

All our clusters use the Azure Monitor Private Link Scope to connect to Azure Monitor. @adjurdjevic, could you please point me to the issue or article where you read about this? Also, is there any possible remedy?

adjurdjevic commented 3 weeks ago

Well i notice that IP from my log point to Private endpoint of Azure Monitor so that lead to conclusion that problem is on Azure Monitor Private Link Scope service, and i have clusters which don't use PE for connection and they don't have this issue. Issue is still there and I think that someone from Microsoft need to look at it. @ghantasunil will you open a ticket?

ghantasunil commented 3 weeks ago

Thanks @adjurdjevic. I have opened the ticket with Microsoft.

Also, @JoeyC-Dev could you please try replicating the issue by creating Azure Monitor Private Link Scope to connect to Azure Monitor.

JoeyC-Dev commented 3 weeks ago

Correct me if I am wrong. First, I don't think Azure Monitor workspace plays something here but the Log Analystics workspace, as Defender for Containers does not use that resource at all. So my target is that LAW resource. Normally, there will be two LAW connected to AKS: one is for Container Insight, another is for Defender. You can check is via az aks show command and you will see two LAW resource URI if you enabled both Container Insight and Defender for Containers.

I have tested for a full day on the Container Insight one and does not see any issue. So I also tried to modify the Defender one and still not seeing anything happening.

Result:

NAME                                                 READY   STATUS    RESTARTS   AGE
microsoft-defender-collector-ds-8vqsd                2/2     Running   0          9h
microsoft-defender-collector-ds-rflkw                2/2     Running   0          9h
microsoft-defender-collector-misc-649f579c5b-mpm96   1/1     Running   0          9h
microsoft-defender-publisher-ds-4k9gh                1/1     Running   0          9h
microsoft-defender-publisher-ds-rrntd                1/1     Running   0          9h

Not sure what I missed here. Like: network plugin, outbound type, different way to set up AMPLS, etc. I will share the script that I used for attempting to reproduce the issue:

# Install aks-preview
az extension add -n aks-preview

# Initial setup
location=eastus2
rG=my-issue-4240-29411
aks=my-aks-4240-29411
aci=my-aci-4240-29411
vnet=my-vnet-4240-29411
logAnalyticsWorkspace=my-law-4240-29411
ampls=my-ampls-4240-29411

az group create -n ${rG} -l ${location} -o none

az network vnet create -n ${vnet} -g ${rG} --address-prefix 10.0.0.0/16 --subnet-name aks --subnet-prefixes 10.0.0.0/23 -o none
az network vnet subnet create --vnet-name ${vnet} -g ${rG} -n aci --address-prefixes 10.0.2.0/23 -o none

vnet_id=$(az resource show -n ${vnet} -g ${rG} --resource-type Microsoft.Network/virtualNetworks --query id -o tsv)

# Creaet a log analytics workspace for collecting container logs
az monitor log-analytics workspace create -n ${logAnalyticsWorkspace} -g ${rG} --ingestion-access Disabled -o none
logAnalyticsWorkspace_resId=$(az resource show -n ${logAnalyticsWorkspace} -g ${rG} --namespace Microsoft.OperationalInsights --resource-type workspaces --query id -o tsv)
logAnalyticsWorkspace_guid=$(az resource show -n ${logAnalyticsWorkspace} -g ${rG} --namespace Microsoft.OperationalInsights --resource-type workspaces  --query properties.customerId -o tsv)

# Create AMPLS
az resource create -n ${ampls} -g ${rG} -l global --api-version "2021-07-01-preview" --resource-type Microsoft.Insights/privateLinkScopes --properties "{\"accessModeSettings\":{\"queryAccessMode\":\"Open\", \"ingestionAccessMode\":\"Open\"}}"
ampls_resId=$(az resource show -n ${ampls} -g ${rG} --resource-type microsoft.insights/privatelinkscopes --query id -o tsv)
az monitor private-link-scope scoped-resource create -n ${ampls} -g ${rG} --scope-name ${ampls} --linked-resource ${logAnalyticsWorkspace_resId} -o none

echo "Tutorial: https://learn.microsoft.com/en-us/azure/azure-monitor/logs/private-link-configure"
read -p "Add private endpoint from Azure portal. It is painful to implement them with az-cli so I don't write it. Make sure subnet is being set to 'aks'. Enter anything to continue..."

Manually setting part: image image image (Note: already make sure only proceed to the rest part of script after completing this deployment.)

az resource patch -n ${ampls} -g ${rG} --api-version "2021-07-01-preview" --resource-type Microsoft.Insights/privateLinkScopes --properties "{\"accessModeSettings\":{\"queryAccessMode\":\"Open\", \"ingestionAccessMode\":\"PrivateOnly\"}}"

# Create an ACI for connecting to AKS later
az container create -n ${aci} -g ${rG} --image mcr.microsoft.com/azure-cli:latest --cpu 1 --memory 1 \
--subnet "${vnet_id}/subnets/aci" --command-line "/bin/sh -c 'while true; do sleep 30; done;'" --no-wait

# The creation of AKS
az aks create -n ${aks} -g ${rG} --node-vm-size Standard_A4_v2 --node-count 2 --enable-private-cluster \
--tier standard --vnet-subnet-id "${vnet_id}/subnets/aks" --enable-defender --no-ssh-key \
--network-plugin kubenet --service-cidr 192.168.2.0/23 --dns-service-ip 192.168.2.2 \
--nrg-lockdown-restriction-level ReadOnly -o none

# aks_ID=$(az resource show -n ${aks} -g ${rG} --resource-type Microsoft.ContainerService/managedClusters --query id -o tsv)

# Enable Container Insight with private log analytics workspace 
# Note: The reason why I split steps are is because they are split in original tutorial.
# Fix "null" collection frequency issue
cat > dataCollectionSettings.json << EOF
{
  "interval": "1m",
  "namespaceFilteringMode": "Off",
  "enableContainerLogV2": true, 
  "streams": ["Microsoft-Perf", "Microsoft-ContainerLogV2", "Microsoft-ContainerInventory", "Microsoft-ContainerNodeInventory", "Microsoft-InsightsMetrics", "Microsoft-KubeEvents", "Microsoft-KubeMonAgentEvents", "Microsoft-KubeNodeInventory", "Microsoft-KubePodInventory", "Microsoft-KubePVInventory", "Microsoft-KubeServices"]
}
EOF
az aks enable-addons -a monitoring -n ${aks} -g ${rG} --workspace-resource-id ${logAnalyticsWorkspace_resId} --data-collection-settings dataCollectionSettings.json -o none

# Change log analytics workspace URI for Defender for Containers
cat > defender.json << EOF
{
    "logAnalyticsWorkspaceResourceId": "${logAnalyticsWorkspace_resId}"
}
EOF
az aks update -n ${aks} -g ${rG} --enable-defender --defender-config defender.json

# Enter ACI
az container exec -g ${rG} -n ${aci} --exec-command "/bin/bash"
########################################
rG=my-issue-4240-29411
aks=my-aks-4240-29411

az login
apk add kubectl
az aks get-credentials -n ${aks} -g ${rG}

# When using `-w` in ACI, it will automatically exit if `kubectl` trying to print new result. So avoid `-w`
while true; do kubectl get po -n kube-system -l app=defender; sleep 60; done;
AndrewSmithRS commented 3 weeks ago

We're experiencing the same issue across 5 separate clusters. We're also using AMPLS for the configured Log Analytics workspace. The issue started after restarting the nodes due to an image upgrade which moved us onto version 1.0.78 of the defender publisher. AKS cluster version is 1.27.9

adjurdjevic commented 3 weeks ago

If you want "dirty" fix for issue, just edit private dns entry for your xxx-xxx-xxx.privatelink.oms.opinsights.azure.com record and add public IP address of your endpoint. @ghantasunil do you have any info from ticket you open?

ghantasunil commented 2 weeks ago

@adjurdjevic here is the response from Microsoft.

The agent is crashing due to missing workspace. This issue has been detected as a bug by the team, and they are working on it to get a permanent fix. However, they have provided mitigation options to fix the issue for now.

The issue was that a POST request related to setting up the agent failed. From what I shared, it's related to this DNS .oms.opinsights.azure.com.

To fix this issue temporarily.

The approach behind this is to temporarily disable Private Link and give the agent some time to complete the setup (and the step that was failing) and then switch back to using Private Link.