akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

Tainted nodes are bidding #253

Open 88plug opened 4 days ago

88plug commented 4 days ago

Describe the bug Some providers have nodes that are tainted in their cluster and the Akash provider is still using them to bid when a cluster is at capacity.

[Warning] [FailedScheduling] [Pod] 0/6 nodes are available: 1 Insufficient memory, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 4 Insufficient cpu. preemption: 0/6 nodes are available: 1 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod..
[FailedScheduling] [Pod] 0/4 nodes are available: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 Insufficient cpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.

I have found at least two styles of taint that are not being respected :

  1. {node-role.kubernetes.io/control-plane: },
  2. {CriticalAddonsOnly: true}

This bug causes causes the deployment never to deploy - and within 5 minute it's closed automatically by the provider.

To Reproduce Deploy to fill each node on a provider and you will get a bid from the tainted node.

Observered in the wild on validatornode.com and various other providers.

Expected behavior Providers who have nodes that are tainted for no-deploy/noscheduling should not bid on workloads.

Additional context Discussed on Sep 18th support call for more detail.

88plug commented 1 day ago

image

Users are reporting this in Discord as well