akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

provider says "insufficient capacity" sporadically #130

Closed andy108369 closed 11 months ago

andy108369 commented 1 year ago

and provider says it has insufficient capacity sometimes (usually gets deployed fine on a 2nd attempt), I've started seeing this from today:

$ date; provider_info.sh provider.hurricane.akash.pub
Thu Sep 28 06:36:37 PM CEST 2023
type       cpu     gpu  ram                 ephemeral           persistent
used       49.5    1    143.5               388.5               500
pending    0       0    0                   0                   0
available  43.395  0    30.681856155395508  1420.2646561246365  713.9642942994833
node       43.395  0    30.681856155395508  1420.2646561246365  N/A```
D[2023-09-28|16:32:54.261] cluster resources dump={"nodes":[{"name":"worker-01.hurricane2","allocatable":{"cpu":102000,"gpu":1,"memory":210936590336,"storage_ephemeral":1942146261054},"available":{"cpu":43395,"gpu":0,"memory":32944392192,"storage_ephemeral":1524997562430}}],"total_allocatable":{"cpu":102000,"gpu":1,"memory":210936590336,"storage_ephemeral":1942146261054,"storage":{"beta3":828738306048}},"total_available":{"cpu":43395,"gpu":0,"memory":32944392192,"storage_ephemeral":1524997562430,"storage":{"beta3":767944789872}}} module=provider-cluster cmp=provider cmp=service cmp=inventory-service
I[2023-09-28|16:32:58.882] order detected                               module=bidengine-service cmp=provider order=order/akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/13006204/1/1
I[2023-09-28|16:32:58.884] group fetched                                module=bidengine-order cmp=provider order=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/13006204/1/1
I[2023-09-28|16:32:58.884] requesting reservation                       module=bidengine-order cmp=provider order=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/13006204/1/1
D[2023-09-28|16:32:58.884] reservation requested                        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/13006204/1/1 resources[{resource:{id:1,cpu:{units:{val:2000}},memory:{size:{val:2147483648}},storage:[{name:default,size:{val:1073741824}}],gpu:{units:{val:0}},endpoints:[{sequence_number:0}]},count:1,price:{denom:uakt,amount:1000.000000000000000000}},{resource:{id:2,cpu:{units:{val:2000}},memory:{size:{val:8589934592}},storage:[{name:default,size:{val:1073741824}}],gpu:{units:{val:0}},endpoints:[{sequence_number:0}]},count:1,price:{denom:uakt,amount:1000.000000000000000000}}]=(MISSING)
I[2023-09-28|16:32:58.884] insufficient capacity for reservation        module=provider-cluster cmp=provider cmp=service cmp=inventory-service order=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/13006204/1/1
E[2023-09-28|16:32:58.884] reserving resources                          module=bidengine-order cmp=provider order=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/13006204/1/1 err="insufficient capacity"
I[2023-09-28|16:32:58.884] shutting down                                module=bidengine-order cmp=provider order=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/13006204/1/1
D[2023-09-28|16:33:01.525] cluster resources dump={"nodes":[{"name":"worker-01.hurricane2","allocatable":{"cpu":102000,"gpu":1,"memory":210936590336,"storage_ephemeral":1942146261054},"available":{"cpu":43395,"gpu":0,"memory":32944392192,"storage_ephemeral":1524997562430}}],"total_allocatable":{"cpu":102000,"gpu":1,"memory":210936590336,"storage_ephemeral":1942146261054,"storage":{"beta3":828694003712}},"total_available":{"cpu":43395,"gpu":0,"memory":32944392192,"storage_ephemeral":1524997562430,"storage":{"beta3":767900487536}}} module=provider-cluster cmp=provider cmp=service cmp=inventory-service

seen the same issue earlier today on the same provider

[https://rpc.akashnet.net:443][default][13001588--1]$ cat deploy.yaml
# Simple deployment.
---
version: "2.0"

services:
  app:
    image: bsord/tetris
    # command:
    #   - "sh"
    #   - "-c"
    # args:
    #   - sleep infinity
    expose:
      - port: 80
        as: 80
        to:
          - global: true
        #accept:
        #  - "tetris.yourdomain.com"

profiles:
  compute:
    app:
      resources:
        cpu:
          units: 1
        memory:
          size: 4Gi
        storage:
          size: 20Gi
  placement:
    akash:
      pricing:
        app:
          denom: uakt
          amount: 1000000

deployment:
  app:
    akash:
      profile: app
      count: 1

$ date; provider_info.sh provider.hurricane.akash.pub
Thu Sep 28 10:56:22 AM CEST 2023
type       cpu     gpu  ram                 ephemeral           persistent
used       55.5    1    156.5               392.5               500
pending    0       0    0                   0                   0
available  37.395  0    17.681856155395508  1416.2646561246365  728.6417311616242
node       37.395  0    17.681856155395508  1416.2646561246365  N/A

image

andy108369 commented 11 months ago

Can't reproduce this after disabling the unattended upgrades.

Likely the unattended upgrades were the root cause of the issue https://github.com/akash-network/support/issues/131