Azure / azure-container-networking

Azure Container Networking Solutions for Linux and Windows Containers
MIT License
377 stars 241 forks source link

[NPM] [Linux] race condition when editing/deleting NetPol #2963

Closed huntergregory closed 1 month ago

huntergregory commented 1 month ago

NOTE: starting in v1.5.36, NPM will auto-restart to mitigate this issue and other non-retriable errors when detected

Similar issue: #2977

EDIT (9/5/24): based on https://github.com/Azure/azure-container-networking/issues/2963#issuecomment-2332545759

Summary

Under this race condition, NPM is incapacitated (unable to process this or future NetworkPolicies) until it is restarted.

This race can occur when editing or deleting a NetworkPolicy with "enough" rules if the kernel is slow to process nftables/ipset changes. It seems like the issue only happens right after NPM's garbage collector marks "enough" IPSets for deletion.

Symptoms

Unexpected connectivity after editing/deleting a NetPol.

NPM logs continually show this failure:

full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 3: The set with the given name does not exist

Mitigation

Restart all NPM Pods (NPM will automatically restart in v1.5.36 and above).

To Avoid the Issue

Cause

NPM translates Pod labels and namespace labels into kernel IPSets (which are referenced by kernel firewall rules). NPM has a 5-minute background thread that marks unreferenced IPSets for deletion. NPM listens for changes to Pods, namespaces, and NetworkPolicies. Right after "enough" IPSets are marked for deletion, if the next change is an edit or deletion of a NetworkPolicy with "enough" rules, then NPM may hit a race condition. At this point, NPM deletes the Policy firewall rules, then tries to delete both the IPSets marked for deletion and some IPSets from the policy. Sometimes the kernel still thinks that firewall rules reference the policy IPsets (hence the log line Set cannot be destroyed: it is in use by a kernel component), so after 5 attempts, NPM will fail to delete all the IPSets. However, NPM may have deleted some of the IPSets marked for deletion. The previous "slow kernel" issue is temporary, but when NPM retries after some time, if 5+ IPSets were already deleted previously, then NPM will remain incapacitated (unable to process this or future NetworkPolicies) until it is restarted.

Theoretically, the garbage collection precondition is not required, although in practice, I cannot reproduce without satisfying this precondition (even with a NetworkPolicy with 240 CIDR rules).

Reproducing

Steps

  1. Add many labels to a Pod, then remove those labels.
    unlbls=""; lbls=""; for i in $(seq 1 50); do lbls="$lbls k$i=v$i" ; unlbls="$unlbls k${i}-" ; done
    pod=dep-a-65bcb864cd-ghfzd
    kubectl label po $pod $lbls
    sleep 3s
    kubectl label po $pod $unlbls
  2. Edit a NetworkPolicy with enough rules.
  3. Wait until NPM garbage collects the IPSets.
  4. Edit above NetworkPolicy.
  5. Make sure that the "slow kernel" precondition occurs, i.e. the following line occurs:
    full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 15: Set cannot be destroyed: it is in use by a kernel component

Logs

I0827 18:08:49.608464       1 podController.go:237] Successfully synced 'default/dep-a-65bcb864cd-ghfzd'
I0827 18:09:06.137332       1 ipsetmanager.go:91] [IPSetManager] removed 100 empty/unreferenced ipsets, updating toDeleteCache to: map[podlabel-k1:0xc0007424a0 podlabel-k10:0xc000742250 podlabel-k10:v10:0xc0007423b0 podlabel-k11:0xc0007423d0 podlabel-k11:v11:0xc000742560 podlabel-k12:0xc0007421b0 podlabel-k12:v12:0xc000742580 podlabel-k13:0xc0007421e0 podlabel-k13:v13:0xc000742180 podlabel-k14:0xc000742590 podlabel-k14:v14:0xc000742240 podlabel-k15:0xc000742610 podlabel-k15:v15:0xc000742480 podlabel-k16:0xc000742430 podlabel-k16:v16:0xc000742370 podlabel-k17:0xc000742490 podlabel-k17:v17:0xc000742550 podlabel-k18:0xc0007426f0 podlabel-k18:v18:0xc000742530 podlabel-k19:0xc0007423a0 podlabel-k19:v19:0xc000742740 podlabel-k1:v1:0xc0007427a0 podlabel-k2:0xc000742210 podlabel-k20:0xc000742270 podlabel-k20:v20:0xc0007427c0 podlabel-k21:0xc000742620 podlabel-k21:v21:0xc000742170 podlabel-k22:0xc000742300 podlabel-k22:v22:0xc000742200 podlabel-k23:0xc0007421a0 podlabel-k23:v23:0xc0007425a0 podlabel-k24:0xc000742360 podlabel-k24:v24:0xc0007423c0 podlabel-k25:0xc000742420 podlabel-k25:v25:0xc000742600 podlabel-k26:0xc000742390 podlabel-k26:v26:0xc0007421d0 podlabel-k27:0xc000742500 podlabel-k27:v27:0xc0007426a0 podlabel-k28:0xc000742680 podlabel-k28:v28:0xc000742730 podlabel-k29:0xc000742780 podlabel-k29:v29:0xc000742400 podlabel-k2:v2:0xc000742470 podlabel-k3:0xc0007426d0 podlabel-k30:0xc000742660 podlabel-k30:v30:0xc0007425b0 podlabel-k31:0xc0007427e0 podlabel-k31:v31:0xc000742350 podlabel-k32:0xc000742520 podlabel-k32:v32:0xc0007424b0 podlabel-k33:0xc000742640 podlabel-k33:v33:0xc0007427d0 podlabel-k34:0xc0007422e0 podlabel-k34:v34:0xc0007422c0 podlabel-k35:0xc000742260 podlabel-k35:v35:0xc000742510 podlabel-k36:0xc000742450 podlabel-k36:v36:0xc0007422d0 podlabel-k37:0xc0007421c0 podlabel-k37:v37:0xc0007427b0 podlabel-k38:0xc0007425c0 podlabel-k38:v38:0xc000742690 podlabel-k39:0xc000742410 podlabel-k39:v39:0xc000742380 podlabel-k3:v3:0xc000742310 podlabel-k4:0xc0007425d0 podlabel-k40:0xc000742290 podlabel-k40:v40:0xc0007424f0 podlabel-k41:0xc000742220 podlabel-k41:v41:0xc000742230 podlabel-k42:0xc000742460 podlabel-k42:v42:0xc000742440 podlabel-k43:0xc0007422b0 podlabel-k43:v43:0xc000742540 podlabel-k44:0xc0007424e0 podlabel-k44:v44:0xc0007422f0 podlabel-k45:0xc000742320 podlabel-k45:v45:0xc000742570 podlabel-k46:0xc000742760 podlabel-k46:v46:0xc0007422a0 podlabel-k47:0xc0007423e0 podlabel-k47:v47:0xc000742340 podlabel-k48:0xc0007426e0 podlabel-k48:v48:0xc000742750 podlabel-k49:0xc0007426b0 podlabel-k49:v49:0xc0007425f0 podlabel-k4:v4:0xc0007425e0 podlabel-k5:0xc000742650 podlabel-k50:0xc000742190 podlabel-k50:v50:0xc000742330 podlabel-k5:v5:0xc0007426c0 podlabel-k6:0xc000742280 podlabel-k6:v6:0xc0007427f0 podlabel-k7:0xc000742670 podlabel-k7:v7:0xc000742770 podlabel-k8:0xc0007423f0 podlabel-k8:v8:0xc000742800 podlabel-k9:0xc0007421f0 podlabel-k9:v9:0xc000742630]
I0827 18:09:26.465244       1 dataplane.go:612] [DataPlane] Update Policy called for pps/ph-pps-sorting-server-policy
I0827 18:09:26.465268       1 dataplane.go:545] [DataPlane] Remove Policy called for pps/ph-pps-sorting-server-policy
I0827 18:09:26.465334       1 chain-management_linux.go:366] Executing iptables command with args [-w 60 -D AZURE-NPM-INGRESS -j AZURE-NPM-INGRESS-2256742151 -m set --match-set azure-npm-4131938257 dst -m set --match-set azure-npm-3027006654 dst -m comment --comment INGRESS-POLICY-pps/ph-pps-sorting-server-policy-TO-podlabel-app:ph-pps-sorting-server-AND-ns-pps-IN-ns-pps]
I0827 18:09:26.468837       1 chain-management_linux.go:366] Executing iptables command with args [-w 60 -D AZURE-NPM-EGRESS -j AZURE-NPM-EGRESS-2256742151 -m set --match-set azure-npm-4131938257 src -m set --match-set azure-npm-3027006654 src -m comment --comment EGRESS-POLICY-pps/ph-pps-sorting-server-policy-FROM-podlabel-app:ph-pps-sorting-server-AND-ns-pps-IN-ns-pps]
I0827 18:09:26.499592       1 restore.go:188] running this restore command: [iptables-nft-restore -w 60 -T filter --noflush]
I0827 18:09:26.501821       1 dataplane.go:330] [DataPlane] [ApplyDataPlane] [APPLY-DP] starting to apply ipsets
I0827 18:09:26.501937       1 ipsetmanager.go:455] [IPSetManager] dirty caches. toAddUpdateCache: to create: [], to update: [], toDeleteCache: map[cidr-ph-pps-sorting-server-policy-in-ns-pps-0-10OUT:0xc0006f2460 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-11OUT:0xc0006f24b0 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-12OUT:0xc0006f2500 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-13OUT:0xc0006f2550 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-2OUT:0xc0006f21e0 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-3OUT:0xc0006f2230 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-4IN:0xc0006f2080 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-4OUT:0xc0006f2280 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-5IN:0xc0006f20d0 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-5OUT:0xc0006f22d0 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-6IN:0xc0006f2120 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-6OUT:0xc0006f2320 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-7IN:0xc0006f2170 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-7OUT:0xc0006f2370 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-8OUT:0xc0006f23c0 cidr-ph-pps-sorting-server-policy-in-ns-pps-0-9OUT:0xc0006f2410 podlabel-app:ph-pps-sorting-client:0xc0006f2030 podlabel-app:ph-pps-sorting-server:0xc0006f25a0 podlabel-k1:0xc0007424a0 podlabel-k10:0xc000742250 podlabel-k10:v10:0xc0007423b0 podlabel-k11:0xc0007423d0 podlabel-k11:v11:0xc000742560 podlabel-k12:0xc0007421b0 podlabel-k12:v12:0xc000742580 podlabel-k13:0xc0007421e0 podlabel-k13:v13:0xc000742180 podlabel-k14:0xc000742590 podlabel-k14:v14:0xc000742240 podlabel-k15:0xc000742610 podlabel-k15:v15:0xc000742480 podlabel-k16:0xc000742430 podlabel-k16:v16:0xc000742370 podlabel-k17:0xc000742490 podlabel-k17:v17:0xc000742550 podlabel-k18:0xc0007426f0 podlabel-k18:v18:0xc000742530 podlabel-k19:0xc0007423a0 podlabel-k19:v19:0xc000742740 podlabel-k1:v1:0xc0007427a0 podlabel-k2:0xc000742210 podlabel-k20:0xc000742270 podlabel-k20:v20:0xc0007427c0 podlabel-k21:0xc000742620 podlabel-k21:v21:0xc000742170 podlabel-k22:0xc000742300 podlabel-k22:v22:0xc000742200 podlabel-k23:0xc0007421a0 podlabel-k23:v23:0xc0007425a0 podlabel-k24:0xc000742360 podlabel-k24:v24:0xc0007423c0 podlabel-k25:0xc000742420 podlabel-k25:v25:0xc000742600 podlabel-k26:0xc000742390 podlabel-k26:v26:0xc0007421d0 podlabel-k27:0xc000742500 podlabel-k27:v27:0xc0007426a0 podlabel-k28:0xc000742680 podlabel-k28:v28:0xc000742730 podlabel-k29:0xc000742780 podlabel-k29:v29:0xc000742400 podlabel-k2:v2:0xc000742470 podlabel-k3:0xc0007426d0 podlabel-k30:0xc000742660 podlabel-k30:v30:0xc0007425b0 podlabel-k31:0xc0007427e0 podlabel-k31:v31:0xc000742350 podlabel-k32:0xc000742520 podlabel-k32:v32:0xc0007424b0 podlabel-k33:0xc000742640 podlabel-k33:v33:0xc0007427d0 podlabel-k34:0xc0007422e0 podlabel-k34:v34:0xc0007422c0 podlabel-k35:0xc000742260 podlabel-k35:v35:0xc000742510 podlabel-k36:0xc000742450 podlabel-k36:v36:0xc0007422d0 podlabel-k37:0xc0007421c0 podlabel-k37:v37:0xc0007427b0 podlabel-k38:0xc0007425c0 podlabel-k38:v38:0xc000742690 podlabel-k39:0xc000742410 podlabel-k39:v39:0xc000742380 podlabel-k3:v3:0xc000742310 podlabel-k4:0xc0007425d0 podlabel-k40:0xc000742290 podlabel-k40:v40:0xc0007424f0 podlabel-k41:0xc000742220 podlabel-k41:v41:0xc000742230 podlabel-k42:0xc000742460 podlabel-k42:v42:0xc000742440 podlabel-k43:0xc0007422b0 podlabel-k43:v43:0xc000742540 podlabel-k44:0xc0007424e0 podlabel-k44:v44:0xc0007422f0 podlabel-k45:0xc000742320 podlabel-k45:v45:0xc000742570 podlabel-k46:0xc000742760 podlabel-k46:v46:0xc0007422a0 podlabel-k47:0xc0007423e0 podlabel-k47:v47:0xc000742340 podlabel-k48:0xc0007426e0 podlabel-k48:v48:0xc000742750 podlabel-k49:0xc0007426b0 podlabel-k49:v49:0xc0007425f0 podlabel-k4:v4:0xc0007425e0 podlabel-k5:0xc000742650 podlabel-k50:0xc000742190 podlabel-k50:v50:0xc000742330 podlabel-k5:v5:0xc0007426c0 podlabel-k6:0xc000742280 podlabel-k6:v6:0xc0007427f0 podlabel-k7:0xc000742670 podlabel-k7:v7:0xc000742770 podlabel-k8:0xc0007423f0 podlabel-k8:v8:0xc000742800 podlabel-k9:0xc0007421f0 podlabel-k9:v9:0xc000742630]
I0827 18:09:26.502341       1 restore.go:188] running this restore command: [ipset restore]
I0827 18:09:26.506396       1 restore.go:299] continuing after line 121 for command [ipset restore]
2024/08/27 18:09:26 [1] skipping destroy line for set cidr-ph-pps-sorting-server-policy-in-ns-pps-0-13OUT since the set is in use by a kernel component
2024/08/27 18:09:26 [1] error: on try number 1, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-X azure-npm-106073494]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 121: Set cannot be destroyed: it is in use by a kernel component
]]
I0827 18:09:26.506743       1 restore.go:188] running this restore command: [ipset restore]
2024/08/27 18:09:26 [1] skipping destroy line for set cidr-ph-pps-sorting-server-policy-in-ns-pps-0-7IN since the set is in use by a kernel component
I0827 18:09:26.508560       1 restore.go:299] continuing after line 4 for command [ipset restore]
2024/08/27 18:09:26 [1] error: on try number 2, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-X azure-npm-3952978400]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 4: Set cannot be destroyed: it is in use by a kernel component
]]
I0827 18:09:26.508753       1 restore.go:188] running this restore command: [ipset restore]
2024/08/27 18:09:26 [1] skipping destroy line for set cidr-ph-pps-sorting-server-policy-in-ns-pps-0-10OUT since the set is in use by a kernel component
2024/08/27 18:09:26 [1] error: on try number 3, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-X azure-npm-62326817]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 4: Set cannot be destroyed: it is in use by a kernel component
]]
I0827 18:09:26.510477       1 restore.go:299] continuing after line 4 for command [ipset restore]
I0827 18:09:26.510654       1 restore.go:188] running this restore command: [ipset restore]
I0827 18:09:26.512173       1 restore.go:299] continuing after line 11 for command [ipset restore]
I0827 18:09:26.512275       1 restore.go:188] running this restore command: [ipset restore]
E0827 18:09:26.513844       1 networkPolicyController.go:195] error syncing 'pps/ph-pps-sorting-server-policy': [syncNetPol] error due to  [syncAddAndUpdateNetPol] Error: failed to update translated NPMNetworkPolicy into Dataplane due to [DataPlane] error while updating policy: [DataPlane] [APPLY-DP] error while applying IPSets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 15: Set cannot be destroyed: it is in use by a kernel component
], requeuing
2024/08/27 18:09:26 [1] skipping destroy line for set cidr-ph-pps-sorting-server-policy-in-ns-pps-0-4IN since the set is in use by a kernel component
2024/08/27 18:09:26 [1] error: on try number 4, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-X azure-npm-2055821385]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 11: Set cannot be destroyed: it is in use by a kernel component
]]
2024/08/27 18:09:26 [1] error: failed to apply ipsets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 15: Set cannot be destroyed: it is in use by a kernel component
]
2024/08/27 18:09:26 [1] syncNetPol error due to error syncing 'pps/ph-pps-sorting-server-policy': [syncNetPol] error due to  [syncAddAndUpdateNetPol] Error: failed to update translated NPMNetworkPolicy into Dataplane due to [DataPlane] error while updating policy: [DataPlane] [APPLY-DP] error while applying IPSets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 15: Set cannot be destroyed: it is in use by a kernel component
], requeuing
I0827 18:09:26.519299       1 dataplane.go:612] [DataPlane] Update Policy called for pps/ph-pps-sorting-server-policy
I0827 18:09:26.519318       1 dataplane.go:615] [DataPlane] Policy pps/ph-pps-sorting-server-policy is not found.
I0827 18:09:26.519324       1 dataplane.go:394] [DataPlane] Add Policy called for pps/ph-pps-sorting-server-policy
I0827 18:09:26.519329       1 types.go:214] [DataPlane] enqueuing policy pps/ph-pps-sorting-server-policy in netPolQueue
I0827 18:09:26.519334       1 dataplane.go:408] [DataPlane] [ADD-NETPOL] new pending netpol count: 1
I0827 18:09:26.519357       1 networkPolicyController.go:191] Successfully synced 'pps/ph-pps-sorting-server-policy'
I0827 18:09:26.636427       1 dataplane.go:421] [DataPlane] adding policies [0xc0001ad970]
I0827 18:09:26.636743       1 dataplane.go:330] [DataPlane] [ApplyDataPlane] [ADD-NETPOL] starting to apply ipsets
I0827 18:09:26.636946       1 ipsetmanager.go:455] [IPSetManager] dirty caches. toAddUpdateCache: to create: [cidr-ph-pps-sorting-server-policy-in-ns-pps-0-14OUT: &{membersToAdd:map[10.103.2.0/23:{}] membersToDelete:map[]}], to update: [cidr-ph-pps-sorting-server-policy-in-ns-pps-0-5OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-7OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-11OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-2OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-4IN: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-6OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-9OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-12OUT: &{membersToAdd:map[] membersToDelete:map[]},podlabel-app:ph-pps-sorting-server: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-5IN: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-7IN: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-4OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-8OUT: &{membersToAdd:map[] membersToDelete:map[]},podlabel-app:ph-pps-sorting-client: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-3OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-10OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-13OUT: &{membersToAdd:map[] membersToDelete:map[]},cidr-ph-pps-sorting-server-policy-in-ns-pps-0-6IN: &{membersToAdd:map[] membersToDelete:map[]}], toDeleteCache: map[podlabel-k1:0xc0007424a0 podlabel-k10:0xc000742250 podlabel-k10:v10:0xc0007423b0 podlabel-k11:0xc0007423d0 podlabel-k11:v11:0xc000742560 podlabel-k12:0xc0007421b0 podlabel-k12:v12:0xc000742580 podlabel-k13:0xc0007421e0 podlabel-k13:v13:0xc000742180 podlabel-k14:0xc000742590 podlabel-k14:v14:0xc000742240 podlabel-k15:0xc000742610 podlabel-k15:v15:0xc000742480 podlabel-k16:0xc000742430 podlabel-k16:v16:0xc000742370 podlabel-k17:0xc000742490 podlabel-k17:v17:0xc000742550 podlabel-k18:0xc0007426f0 podlabel-k18:v18:0xc000742530 podlabel-k19:0xc0007423a0 podlabel-k19:v19:0xc000742740 podlabel-k1:v1:0xc0007427a0 podlabel-k2:0xc000742210 podlabel-k20:0xc000742270 podlabel-k20:v20:0xc0007427c0 podlabel-k21:0xc000742620 podlabel-k21:v21:0xc000742170 podlabel-k22:0xc000742300 podlabel-k22:v22:0xc000742200 podlabel-k23:0xc0007421a0 podlabel-k23:v23:0xc0007425a0 podlabel-k24:0xc000742360 podlabel-k24:v24:0xc0007423c0 podlabel-k25:0xc000742420 podlabel-k25:v25:0xc000742600 podlabel-k26:0xc000742390 podlabel-k26:v26:0xc0007421d0 podlabel-k27:0xc000742500 podlabel-k27:v27:0xc0007426a0 podlabel-k28:0xc000742680 podlabel-k28:v28:0xc000742730 podlabel-k29:0xc000742780 podlabel-k29:v29:0xc000742400 podlabel-k2:v2:0xc000742470 podlabel-k3:0xc0007426d0 podlabel-k30:0xc000742660 podlabel-k30:v30:0xc0007425b0 podlabel-k31:0xc0007427e0 podlabel-k31:v31:0xc000742350 podlabel-k32:0xc000742520 podlabel-k32:v32:0xc0007424b0 podlabel-k33:0xc000742640 podlabel-k33:v33:0xc0007427d0 podlabel-k34:0xc0007422e0 podlabel-k34:v34:0xc0007422c0 podlabel-k35:0xc000742260 podlabel-k35:v35:0xc000742510 podlabel-k36:0xc000742450 podlabel-k36:v36:0xc0007422d0 podlabel-k37:0xc0007421c0 podlabel-k37:v37:0xc0007427b0 podlabel-k38:0xc0007425c0 podlabel-k38:v38:0xc000742690 podlabel-k39:0xc000742410 podlabel-k39:v39:0xc000742380 podlabel-k3:v3:0xc000742310 podlabel-k4:0xc0007425d0 podlabel-k40:0xc000742290 podlabel-k40:v40:0xc0007424f0 podlabel-k41:0xc000742220 podlabel-k41:v41:0xc000742230 podlabel-k42:0xc000742460 podlabel-k42:v42:0xc000742440 podlabel-k43:0xc0007422b0 podlabel-k43:v43:0xc000742540 podlabel-k44:0xc0007424e0 podlabel-k44:v44:0xc0007422f0 podlabel-k45:0xc000742320 podlabel-k45:v45:0xc000742570 podlabel-k46:0xc000742760 podlabel-k46:v46:0xc0007422a0 podlabel-k47:0xc0007423e0 podlabel-k47:v47:0xc000742340 podlabel-k48:0xc0007426e0 podlabel-k48:v48:0xc000742750 podlabel-k49:0xc0007426b0 podlabel-k49:v49:0xc0007425f0 podlabel-k4:v4:0xc0007425e0 podlabel-k5:0xc000742650 podlabel-k50:0xc000742190 podlabel-k50:v50:0xc000742330 podlabel-k5:v5:0xc0007426c0 podlabel-k6:0xc000742280 podlabel-k6:v6:0xc0007427f0 podlabel-k7:0xc000742670 podlabel-k7:v7:0xc000742770 podlabel-k8:0xc0007423f0 podlabel-k8:v8:0xc000742800 podlabel-k9:0xc0007421f0 podlabel-k9:v9:0xc000742630]
I0827 18:09:26.637506       1 restore.go:188] running this restore command: [ipset restore]
I0827 18:09:26.641225       1 restore.go:304] continuing after line 24 and aborting section [delete-podlabel-k36] for command [ipset restore]
I0827 18:09:26.641840       1 ipsetmanager_linux.go:690] skipping flush and upcoming destroy for set podlabel-k36 since the set doesn't exist
2024/08/27 18:09:26 [1] error: on try number 1, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-F azure-npm-4013374983]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 24: The set with the given name does not exist
]]
I0827 18:09:26.642392       1 restore.go:188] running this restore command: [ipset restore]
2024/08/27 18:09:26 [1] error: on try number 2, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-F azure-npm-1748249323]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 1: The set with the given name does not exist
]]
I0827 18:09:26.644670       1 restore.go:304] continuing after line 1 and aborting section [delete-podlabel-k21] for command [ipset restore]
I0827 18:09:26.644690       1 ipsetmanager_linux.go:690] skipping flush and upcoming destroy for set podlabel-k21 since the set doesn't exist
I0827 18:09:26.644784       1 restore.go:188] running this restore command: [ipset restore]
I0827 18:09:26.646687       1 restore.go:304] continuing after line 6 and aborting section [delete-podlabel-k17] for command [ipset restore]
I0827 18:09:26.646712       1 ipsetmanager_linux.go:690] skipping flush and upcoming destroy for set podlabel-k17 since the set doesn't exist
I0827 18:09:26.646788       1 restore.go:188] running this restore command: [ipset restore]
2024/08/27 18:09:26 [1] error: on try number 3, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-F azure-npm-3728449650]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 6: The set with the given name does not exist
]]
I0827 18:09:26.651890       1 restore.go:304] continuing after line 4 and aborting section [delete-podlabel-k24:v24] for command [ipset restore]
I0827 18:09:26.652173       1 ipsetmanager_linux.go:690] skipping flush and upcoming destroy for set podlabel-k24:v24 since the set doesn't exist
2024/08/27 18:09:26 [1] error: on try number 4, failed to run command [ipset restore]. Rerunning with updated file. err: [line-number error for line [-F azure-npm-3513573922]: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 4: The set with the given name does not exist
]]
I0827 18:09:26.653474       1 restore.go:188] running this restore command: [ipset restore]
2024/08/27 18:09:26 [1] error: failed to apply ipsets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 3: The set with the given name does not exist
]
E0827 18:09:26.656098       1 dataplane.go:431] [DataPlane] [BACKGROUND] failed to add policies. will retry one policy at a time. err: [DataPlane] [ADD-NETPOL] error while applying IPSets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 3: The set with the given name does not exist
]
2024/08/27 18:09:26 [1] [DataPlane] [BACKGROUND] failed to add policies. err: [DataPlane] [ADD-NETPOL] error while applying IPSets: ipset restore failed when applying ipsets: Operation [RunCommandWithFile] failed with error code [999], full cmd [], full error after 5 tries, failed to run command [ipset restore] with error: error running command [ipset restore] with err [exit status 1] and stdErr [ipset v7.5: Error in line 3: The set with the given name does not exist
]
...
huntergregory commented 1 month ago

This issue can actually occur for NetPol without CIDR rule. Most likely to occur for NetPol with CIDR rule or MatchExpressions. Yet could happen for other IPSets if they are not associated with any Pod IP.