Open ghantasunil opened 3 weeks ago
I put an AKS with 1.29.2
for a whole night and did not see any issue:
So I don't think this issue is reproduceable, or not easy to reproduce.
Maybe you can try disable and enable defender on AKS to see if the issue resolved, but I believe you should open a ticket so SE will help you.
I have same issue on one AKS cluster but with AKS version 1.27.7 so it isn't related to an AKS upgrade. Issue started last night after kured restart of nodes in cluster.
It seems that this issue only hitting environment where Azure Monitor Private Link Scope is in use for connection to Azure Monitor, environment which doesn't use AMPLS run fine.
Thanks for looking into it, @JoeyC-Dev
I understand that it is not related to an AKS upgrade, but I am able to replicate this issue on multiple clusters.
I tried restarting the microsoft-defender-publisher-ds
Daemonset, but it was of no use.
All our clusters use the Azure Monitor Private Link Scope
to connect to Azure Monitor.
@adjurdjevic, could you please point me to the issue or article where you read about this? Also, is there any possible remedy?
Well i notice that IP from my log point to Private endpoint of Azure Monitor so that lead to conclusion that problem is on Azure Monitor Private Link Scope service, and i have clusters which don't use PE for connection and they don't have this issue. Issue is still there and I think that someone from Microsoft need to look at it. @ghantasunil will you open a ticket?
Thanks @adjurdjevic. I have opened the ticket with Microsoft.
Also, @JoeyC-Dev could you please try replicating the issue by creating Azure Monitor Private Link Scope
to connect to Azure Monitor.
Correct me if I am wrong.
First, I don't think Azure Monitor workspace
plays something here but the Log Analystics workspace
, as Defender for Containers does not use that resource at all.
So my target is that LAW resource. Normally, there will be two LAW connected to AKS: one is for Container Insight, another is for Defender. You can check is via az aks show
command and you will see two LAW resource URI if you enabled both Container Insight and Defender for Containers.
I have tested for a full day on the Container Insight one and does not see any issue. So I also tried to modify the Defender one and still not seeing anything happening.
Result:
NAME READY STATUS RESTARTS AGE
microsoft-defender-collector-ds-8vqsd 2/2 Running 0 9h
microsoft-defender-collector-ds-rflkw 2/2 Running 0 9h
microsoft-defender-collector-misc-649f579c5b-mpm96 1/1 Running 0 9h
microsoft-defender-publisher-ds-4k9gh 1/1 Running 0 9h
microsoft-defender-publisher-ds-rrntd 1/1 Running 0 9h
Not sure what I missed here. Like: network plugin, outbound type, different way to set up AMPLS, etc. I will share the script that I used for attempting to reproduce the issue:
# Install aks-preview
az extension add -n aks-preview
# Initial setup
location=eastus2
rG=my-issue-4240-29411
aks=my-aks-4240-29411
aci=my-aci-4240-29411
vnet=my-vnet-4240-29411
logAnalyticsWorkspace=my-law-4240-29411
ampls=my-ampls-4240-29411
az group create -n ${rG} -l ${location} -o none
az network vnet create -n ${vnet} -g ${rG} --address-prefix 10.0.0.0/16 --subnet-name aks --subnet-prefixes 10.0.0.0/23 -o none
az network vnet subnet create --vnet-name ${vnet} -g ${rG} -n aci --address-prefixes 10.0.2.0/23 -o none
vnet_id=$(az resource show -n ${vnet} -g ${rG} --resource-type Microsoft.Network/virtualNetworks --query id -o tsv)
# Creaet a log analytics workspace for collecting container logs
az monitor log-analytics workspace create -n ${logAnalyticsWorkspace} -g ${rG} --ingestion-access Disabled -o none
logAnalyticsWorkspace_resId=$(az resource show -n ${logAnalyticsWorkspace} -g ${rG} --namespace Microsoft.OperationalInsights --resource-type workspaces --query id -o tsv)
logAnalyticsWorkspace_guid=$(az resource show -n ${logAnalyticsWorkspace} -g ${rG} --namespace Microsoft.OperationalInsights --resource-type workspaces --query properties.customerId -o tsv)
# Create AMPLS
az resource create -n ${ampls} -g ${rG} -l global --api-version "2021-07-01-preview" --resource-type Microsoft.Insights/privateLinkScopes --properties "{\"accessModeSettings\":{\"queryAccessMode\":\"Open\", \"ingestionAccessMode\":\"Open\"}}"
ampls_resId=$(az resource show -n ${ampls} -g ${rG} --resource-type microsoft.insights/privatelinkscopes --query id -o tsv)
az monitor private-link-scope scoped-resource create -n ${ampls} -g ${rG} --scope-name ${ampls} --linked-resource ${logAnalyticsWorkspace_resId} -o none
echo "Tutorial: https://learn.microsoft.com/en-us/azure/azure-monitor/logs/private-link-configure"
read -p "Add private endpoint from Azure portal. It is painful to implement them with az-cli so I don't write it. Make sure subnet is being set to 'aks'. Enter anything to continue..."
Manually setting part: (Note: already make sure only proceed to the rest part of script after completing this deployment.)
az resource patch -n ${ampls} -g ${rG} --api-version "2021-07-01-preview" --resource-type Microsoft.Insights/privateLinkScopes --properties "{\"accessModeSettings\":{\"queryAccessMode\":\"Open\", \"ingestionAccessMode\":\"PrivateOnly\"}}"
# Create an ACI for connecting to AKS later
az container create -n ${aci} -g ${rG} --image mcr.microsoft.com/azure-cli:latest --cpu 1 --memory 1 \
--subnet "${vnet_id}/subnets/aci" --command-line "/bin/sh -c 'while true; do sleep 30; done;'" --no-wait
# The creation of AKS
az aks create -n ${aks} -g ${rG} --node-vm-size Standard_A4_v2 --node-count 2 --enable-private-cluster \
--tier standard --vnet-subnet-id "${vnet_id}/subnets/aks" --enable-defender --no-ssh-key \
--network-plugin kubenet --service-cidr 192.168.2.0/23 --dns-service-ip 192.168.2.2 \
--nrg-lockdown-restriction-level ReadOnly -o none
# aks_ID=$(az resource show -n ${aks} -g ${rG} --resource-type Microsoft.ContainerService/managedClusters --query id -o tsv)
# Enable Container Insight with private log analytics workspace
# Note: The reason why I split steps are is because they are split in original tutorial.
# Fix "null" collection frequency issue
cat > dataCollectionSettings.json << EOF
{
"interval": "1m",
"namespaceFilteringMode": "Off",
"enableContainerLogV2": true,
"streams": ["Microsoft-Perf", "Microsoft-ContainerLogV2", "Microsoft-ContainerInventory", "Microsoft-ContainerNodeInventory", "Microsoft-InsightsMetrics", "Microsoft-KubeEvents", "Microsoft-KubeMonAgentEvents", "Microsoft-KubeNodeInventory", "Microsoft-KubePodInventory", "Microsoft-KubePVInventory", "Microsoft-KubeServices"]
}
EOF
az aks enable-addons -a monitoring -n ${aks} -g ${rG} --workspace-resource-id ${logAnalyticsWorkspace_resId} --data-collection-settings dataCollectionSettings.json -o none
# Change log analytics workspace URI for Defender for Containers
cat > defender.json << EOF
{
"logAnalyticsWorkspaceResourceId": "${logAnalyticsWorkspace_resId}"
}
EOF
az aks update -n ${aks} -g ${rG} --enable-defender --defender-config defender.json
# Enter ACI
az container exec -g ${rG} -n ${aci} --exec-command "/bin/bash"
########################################
rG=my-issue-4240-29411
aks=my-aks-4240-29411
az login
apk add kubectl
az aks get-credentials -n ${aks} -g ${rG}
# When using `-w` in ACI, it will automatically exit if `kubectl` trying to print new result. So avoid `-w`
while true; do kubectl get po -n kube-system -l app=defender; sleep 60; done;
We're experiencing the same issue across 5 separate clusters. We're also using AMPLS for the configured Log Analytics workspace. The issue started after restarting the nodes due to an image upgrade which moved us onto version 1.0.78 of the defender publisher. AKS cluster version is 1.27.9
If you want "dirty" fix for issue, just edit private dns entry for your xxx-xxx-xxx.privatelink.oms.opinsights.azure.com record and add public IP address of your endpoint. @ghantasunil do you have any info from ticket you open?
@adjurdjevic here is the response from Microsoft.
The agent is crashing due to missing workspace. This issue has been detected as a bug by the team, and they are working on it to get a permanent fix. However, they have provided mitigation options to fix the issue for now.
The issue was that a POST request related to setting up the agent failed. From what I shared, it's related to this DNS .oms.opinsights.azure.com.
To fix this issue temporarily.
The approach behind this is to temporarily disable Private Link and give the agent some time to complete the setup (and the step that was failing) and then switch back to using Private Link.
Describe the bug
microsoft-defender-publisher-ds
pod is crash looping after upgrading the AKS cluster to 1.29.2Expected behavior do not crash loop and restart normally.
Screenshots
Environment (please complete the following information):
2.59.0
1.29.2
Additional context
Logs from the container