akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

Provider Status Endpoint Shows Very High GPU Count if Node is Mis-Labeled #120

Closed chainzero closed 1 year ago

chainzero commented 1 year ago

Overview

Problem Summary

When Kubernetes nodes are labeled with GPU capabilities of false instead of true this leads to the provider's status API endpoint reporting a very large count of GPU resources erroneously.

Example label that would cause issue:

kubectl label node node1 akash.network/capabilities.gpu.vendor.nvidia.model.a4000=false

Example provider status endpoint output when a node is labeled in this manner and with a false value (note gpu count of the first index in the array):

      "available": {
        "nodes": [
          {
            "cpu": 20425,
            "gpu": 18446744073709552000,
            "memory": 99048384512,
            "storage_ephemeral": 205218516186
          },
          {
            "cpu": 2825,
            "gpu": 2,
            "memory": 15607097344,
            "storage_ephemeral": 239578254554
          }

Additional Details

Akash documentation suggest only setting GPU capabilities to true such as:

kubectl label node node1 akash.network/capabilities.gpu.vendor.nvidia.model.a4000=true

However users may believe that when a GPU is removed from a node the label should be updated to false. Which is how this issue was discovered. If the label is instead removed from the node - which we instruct users to do to resolve issue - no issues will occur.

chainzero commented 1 year ago

Closing issue as this has been deemed not be an issue with improper label use. Rather the issue occurs when a GPU is removed from a running/active/powered on host.