akash tx update response & behavior when provider is out of resources

arno01 commented 2 years ago

When a user updates his deployment, he may get the following, confusing him, message:

in the following example he was using akashlytics to update his deployment

web: undefined [Warning] [FailedScheduling] [Pod] 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient cpu.
web: undefined [Normal] [SuccessfulCreate] [ReplicaSet] Created pod: web-6db9665ccb-92p4v
web: undefined [Normal] [ScalingReplicaSet] [Deployment] Scaled up replica set web-6db9665ccb to 1

This is happening because K8s won't destroy an old pod instance until it ensures the new one has been created. Since there is no available node for deploying the new pod, it gets stuck in "Running" & "Pending" state. Things will move on as soon as one of the nodes gets enough CPU, RAM & disk requested by the deployment. This is how K8s is working in order to prevent the service outage, however the user might want to get a better message. OR, alternatively, a user could be granted an option such as --force which would destroy the previously running Pod, i.e. that would probably be similar to destroy & recreate method.

root@foxtrot:~# kubectl -n $NS get pods
NAME                   READY   STATUS    RESTARTS   AGE
web-69989588c7-2w5c4   1/1     Running   0          17h
web-6db9665ccb-92p4v   0/1     Pending   0          18m

root@foxtrot:~# kubectl -n $NS describe pods | grep -Ew "^Name:|cpu:"
Name:         web-69989588c7-2w5c4
      cpu:                10
      cpu:                10
Name:           web-6db9665ccb-92p4v
      cpu:                10
      cpu:                10

Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.444] inventory fetched                            module=provider-cluster cmp=service cmp=inventory-service nodes=7
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=foxtrot.provider available-cpu="units:<val:\"6875\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"16263143424\" > " available-storage="quantity:<val:\"225335708095\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=golf.provider available-cpu="units:<val:\"125\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"30669094912\" > " available-storage="quantity:<val:\"880184186644\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id= available-cpu="units:<val:\"0\" > attributes:<key:\"arch\" > " available-memory="quantity:<val:\"0\" > " available-storage="quantity:<val:\"0\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.445] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=alpha.ingress available-cpu="units:<val:\"3625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"13253083136\" > " available-storage="quantity:<val:\"849673462802\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=bravo.ingress available-cpu="units:<val:\"8025\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"22012821504\" > " available-storage="quantity:<val:\"313339421714\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=charley.ingress available-cpu="units:<val:\"5625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"27698855936\" > " available-storage="quantity:<val:\"96585534905\" > "
Jan 04 12:08:23 foxtrot.provider start-provider.sh[349407]: D[2022-01-04|12:08:23.446] node resources                               module=provider-cluster cmp=service cmp=inventory-service node-id=delta.ingress available-cpu="units:<val:\"3625\" > attributes:<key:\"arch\" value:\"amd64\" > " available-memory="quantity:<val:\"31188185088\" > " available-storage="quantity:<val:\"880234408520\" > "

arno01 commented 2 years ago

cc @boz @dmikey

88plug commented 2 years ago

I am that user -

Filled up a providers machines with some xmrig deployments to over 90% fill rate - had 2 of them crash, went to re-deploy a new image using the update button in Akashlytics and was unable to

web: undefined [Warning] [FailedScheduling] [Pod] 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient cpu.
web: undefined [Normal] [SuccessfulCreate] [ReplicaSet] Created pod: web-6db9665ccb-92p4v
web: undefined [Normal] [ScalingReplicaSet] [Deployment] Scaled up replica set web-6db9665ccb to 1

Strasser-Pablo commented 2 years ago

As a provider, I have the same situation with the same user deploying 2 deployement and beeing unable to update the deployement. A new replica set is created but because of lack of ressource the new replica set never come online. Note that there is no disruption of service as the old replica set stay up.

boz commented 2 years ago

Thanks all for the report. Interesting case. Few thoughts:

Some kind of optional --force with a clear message around it is a good suggestion.
Inventory "overcommit" can be reduced.
Inventory can always reserve double the largest deployed resources.

All of them have drawbacks, of course. In the meantime, for mining and other stateless workloads, I suggest closing the deployment and creating a new one if you hit the described scenario.

Strasser-Pablo commented 2 years ago

I think having a default option which use --force is a good solution. I think the majority of deployment can be forced. For the rare case where high availability is needed a special option for that could be used but with the understanding that update may be more difficult. Currently as a provider when I see a deployment stuck because of that, generally miner I just force delete the old replica set and make sure the deployment work again.

dmikey commented 2 years ago

I've hit this today, bug still in place.

andy108369 commented 1 year ago

This is still happening with provider-services 0.1.0, akash 0.20.0. Especially when provider is packed.

[Warning] [FailedScheduling] [Pod] 0/5 nodes are available: 5 Insufficient cpu. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.

andy108369 commented 1 year ago

akash force new replicasets workaround

Create /usr/local/bin/akash-force-new-replicasets.sh file

cat > /usr/local/bin/akash-force-new-replicasets.sh <<'EOF'
#!/bin/bash
#
# Version: 0.2 - 25 March 2023
# Files:
# - /usr/local/bin/akash-force-new-replicasets.sh
# - /etc/cron.d/akash-force-new-replicasets
#
# Description:
# This workaround goes through the newest deployments/replicasets, pods of which can't get deployed due to "insufficient resources" errors and it then removes the older replicasets leaving the newest (latest) one.
# This is only a workaround until a better solution to https://github.com/akash-network/support/issues/82 is found.
#

kubectl get deployment -l akash.network/manifest-service -A -o=jsonpath='{range .items[*]}{.metadata.namespace} {.metadata.name}{"\n"}{end}' |
  while read ns app; do
    kubectl -n $ns rollout status --timeout=10s deployment/${app} >/dev/null 2>&1
    rc=$?
    if [[ $rc -ne 0 ]]; then
      if kubectl -n $ns describe pods | grep -q "Insufficient"; then
        OLD="$(kubectl -n $ns get replicaset -o json -l akash.network/manifest-service --sort-by='{.metadata.creationTimestamp}' | jq -r '(.items | reverse)[1:][] | .metadata.name')"
        for i in $OLD; do kubectl -n $ns delete replicaset $i; done
      fi
    fi
  done
EOF

Mark it as executable file

chmod +x /usr/local/bin/akash-force-new-replicasets.sh

Create the crontab job /etc/cron.d/akash-force-new-replicasets to run the workaround every 5 minutes


cat > /etc/cron.d/akash-force-new-replicasets << 'EOF'
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
SHELL=/bin/bash

/5 * root /usr/local/bin/akash-force-new-replicasets.sh EOF

andy108369 commented 1 year ago

Workaround has been added to the docs https://akash.network/docs/providers/provider-faq-and-guide/#force-new-replicaset-workaround by @chainzero

sterburg commented 1 year ago

Let the user define what kind of SLA he wants on his deployments. If he is ok with a short downtime during redeployment then just set the strategy to Recreate instead of Rolling. That was rhe old pod is taken down first before spinning up a new pod. If he want high-availibility SLA then set strategy to Rolling (and replicas to 2+).

Other thing is EvictionPolicy: You can set to evict pods with a lower priorityClass whenever a higher priority pod gets scheduled or whenever the node has an eviction pressure (is full). After pods are evicted (force delete) and node is freed up, new pod schedules can take place. This way you have a disruption of the lower priority pods, not the highest prio ones. And as soon as that temporary extra pod during rrdeployment is gone the evicted low prio pod would fit back onto the node too.

And the last thing: If the user buys a certain amount of resources the resources should get reserved using a resourcequota. This way we make sure there is always room left.

I am mentioning all this because a shell script as a cronjob doesn't really feel like the right "cloud-native" way of working. Kubernetes has so many powerful features, especially for these type of scheduling use-cases that it is a waste not to do it the correct way.

The other thing is: A professional provider would not let its cluster fill up completely and always keep a margin. I would suggest to make some margin mandatory. and to make capacity monitoring and management mandatory (guidelines?).

This issue should be an edge use-case as node fill-ups should not happen. the focus of this issue should be to prevent, not to remediate in heinsight.

andy108369 commented 1 week ago

Let the user define what kind of SLA he wants on his deployments. If he is ok with a short downtime during redeployment then just set the strategy to Recreate instead of Rolling. That was rhe old pod is taken down first before spinning up a new pod. If he want high-availibility SLA then set strategy to Rolling (and replicas to 2+).

Other thing is EvictionPolicy: You can set to evict pods with a lower priorityClass whenever a higher priority pod gets scheduled or whenever the node has an eviction pressure (is full). After pods are evicted (force delete) and node is freed up, new pod schedules can take place. This way you have a disruption of the lower priority pods, not the highest prio ones. And as soon as that temporary extra pod during rrdeployment is gone the evicted low prio pod would fit back onto the node too.

And the last thing: If the user buys a certain amount of resources the resources should get reserved using a resourcequota. This way we make sure there is always room left.

I am mentioning all this because a shell script as a cronjob doesn't really feel like the right "cloud-native" way of working. Kubernetes has so many powerful features, especially for these type of scheduling use-cases that it is a waste not to do it the correct way.

The other thing is: A professional provider would not let its cluster fill up completely and always keep a margin. I would suggest to make some margin mandatory. and to make capacity monitoring and management mandatory (guidelines?).

This issue should be an edge use-case as node fill-ups should not happen. the focus of this issue should be to prevent, not to remediate in heinsight.

Thank you @sterburg for your observations, all are making total sence to me. I've turned this into the discussion post here

akash-network / support

akash tx update response & behavior when provider is out of resources #82

akash force new replicasets workaround