Open arno01 opened 2 years ago
cc @boz @dmikey
I am that user -
Filled up a providers machines with some xmrig deployments to over 90% fill rate - had 2 of them crash, went to re-deploy a new image using the update button in Akashlytics and was unable to
web: undefined [Warning] [FailedScheduling] [Pod] 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient cpu.
web: undefined [Normal] [SuccessfulCreate] [ReplicaSet] Created pod: web-6db9665ccb-92p4v
web: undefined [Normal] [ScalingReplicaSet] [Deployment] Scaled up replica set web-6db9665ccb to 1
As a provider, I have the same situation with the same user deploying 2 deployement and beeing unable to update the deployement. A new replica set is created but because of lack of ressource the new replica set never come online. Note that there is no disruption of service as the old replica set stay up.
Thanks all for the report. Interesting case. Few thoughts:
--force
with a clear message around it is a good suggestion.All of them have drawbacks, of course. In the meantime, for mining and other stateless workloads, I suggest closing the deployment and creating a new one if you hit the described scenario.
I think having a default option which use --force is a good solution. I think the majority of deployment can be forced. For the rare case where high availability is needed a special option for that could be used but with the understanding that update may be more difficult. Currently as a provider when I see a deployment stuck because of that, generally miner I just force delete the old replica set and make sure the deployment work again.
I've hit this today, bug still in place.
This is still happening with provider-services 0.1.0, akash 0.20.0. Especially when provider is packed.
[Warning] [FailedScheduling] [Pod] 0/5 nodes are available: 5 Insufficient cpu. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.
/usr/local/bin/akash-force-new-replicasets.sh
filecat > /usr/local/bin/akash-force-new-replicasets.sh <<'EOF'
#!/bin/bash
#
# Version: 0.2 - 25 March 2023
# Files:
# - /usr/local/bin/akash-force-new-replicasets.sh
# - /etc/cron.d/akash-force-new-replicasets
#
# Description:
# This workaround goes through the newest deployments/replicasets, pods of which can't get deployed due to "insufficient resources" errors and it then removes the older replicasets leaving the newest (latest) one.
# This is only a workaround until a better solution to https://github.com/akash-network/support/issues/82 is found.
#
kubectl get deployment -l akash.network/manifest-service -A -o=jsonpath='{range .items[*]}{.metadata.namespace} {.metadata.name}{"\n"}{end}' |
while read ns app; do
kubectl -n $ns rollout status --timeout=10s deployment/${app} >/dev/null 2>&1
rc=$?
if [[ $rc -ne 0 ]]; then
if kubectl -n $ns describe pods | grep -q "Insufficient"; then
OLD="$(kubectl -n $ns get replicaset -o json -l akash.network/manifest-service --sort-by='{.metadata.creationTimestamp}' | jq -r '(.items | reverse)[1:][] | .metadata.name')"
for i in $OLD; do kubectl -n $ns delete replicaset $i; done
fi
fi
done
EOF
Mark it as executable file
chmod +x /usr/local/bin/akash-force-new-replicasets.sh
Create the crontab job /etc/cron.d/akash-force-new-replicasets
to run the workaround every 5 minutes
cat > /etc/cron.d/akash-force-new-replicasets << 'EOF'
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
SHELL=/bin/bash
/5 * root /usr/local/bin/akash-force-new-replicasets.sh EOF
Workaround has been added to the docs https://akash.network/docs/providers/provider-faq-and-guide/#force-new-replicaset-workaround by @chainzero
Let the user define what kind of SLA he wants on his deployments. If he is ok with a short downtime during redeployment then just set the strategy to Recreate instead of Rolling. That was rhe old pod is taken down first before spinning up a new pod. If he want high-availibility SLA then set strategy to Rolling (and replicas to 2+).
Other thing is EvictionPolicy: You can set to evict pods with a lower priorityClass whenever a higher priority pod gets scheduled or whenever the node has an eviction pressure (is full). After pods are evicted (force delete) and node is freed up, new pod schedules can take place. This way you have a disruption of the lower priority pods, not the highest prio ones. And as soon as that temporary extra pod during rrdeployment is gone the evicted low prio pod would fit back onto the node too.
And the last thing: If the user buys a certain amount of resources the resources should get reserved using a resourcequota. This way we make sure there is always room left.
I am mentioning all this because a shell script as a cronjob doesn't really feel like the right "cloud-native" way of working. Kubernetes has so many powerful features, especially for these type of scheduling use-cases that it is a waste not to do it the correct way.
The other thing is: A professional provider would not let its cluster fill up completely and always keep a margin. I would suggest to make some margin mandatory. and to make capacity monitoring and management mandatory (guidelines?).
This issue should be an edge use-case as node fill-ups should not happen. the focus of this issue should be to prevent, not to remediate in heinsight.
Let the user define what kind of SLA he wants on his deployments. If he is ok with a short downtime during redeployment then just set the strategy to Recreate instead of Rolling. That was rhe old pod is taken down first before spinning up a new pod. If he want high-availibility SLA then set strategy to Rolling (and replicas to 2+).
Other thing is EvictionPolicy: You can set to evict pods with a lower priorityClass whenever a higher priority pod gets scheduled or whenever the node has an eviction pressure (is full). After pods are evicted (force delete) and node is freed up, new pod schedules can take place. This way you have a disruption of the lower priority pods, not the highest prio ones. And as soon as that temporary extra pod during rrdeployment is gone the evicted low prio pod would fit back onto the node too.
And the last thing: If the user buys a certain amount of resources the resources should get reserved using a resourcequota. This way we make sure there is always room left.
I am mentioning all this because a shell script as a cronjob doesn't really feel like the right "cloud-native" way of working. Kubernetes has so many powerful features, especially for these type of scheduling use-cases that it is a waste not to do it the correct way.
The other thing is: A professional provider would not let its cluster fill up completely and always keep a margin. I would suggest to make some margin mandatory. and to make capacity monitoring and management mandatory (guidelines?).
This issue should be an edge use-case as node fill-ups should not happen. the focus of this issue should be to prevent, not to remediate in heinsight.
Thank you @sterburg for your observations, all are making total sence to me. I've turned this into the discussion post here
When a user updates his deployment, he may get the following, confusing him, message:
This is happening because K8s won't destroy an old pod instance until it ensures the new one has been created. Since there is no available node for deploying the new pod, it gets stuck in "Running" & "Pending" state. Things will move on as soon as one of the nodes gets enough CPU, RAM & disk requested by the deployment. This is how K8s is working in order to prevent the service outage, however the user might want to get a better message. OR, alternatively, a user could be granted an option such as
--force
which would destroy the previously running Pod, i.e. that would probably be similar todestroy
&recreate
method.