When Kubernetes nodes are labeled with GPU capabilities of false instead of true this leads to the provider's status API endpoint reporting a very large count of GPU resources erroneously.
However users may believe that when a GPU is removed from a node the label should be updated to false. Which is how this issue was discovered. If the label is instead removed from the node - which we instruct users to do to resolve issue - no issues will occur.
Closing issue as this has been deemed not be an issue with improper label use. Rather the issue occurs when a GPU is removed from a running/active/powered on host.
Overview
0.4.6
Problem Summary
When Kubernetes nodes are labeled with GPU capabilities of
false
instead oftrue
this leads to the provider's status API endpoint reporting a very large count of GPU resources erroneously.Example label that would cause issue:
kubectl label node node1 akash.network/capabilities.gpu.vendor.nvidia.model.a4000=false
Example provider status endpoint output when a node is labeled in this manner and with a false value (note
gpu
count of the first index in the array):Additional Details
Akash documentation suggest only setting GPU capabilities to true such as:
kubectl label node node1 akash.network/capabilities.gpu.vendor.nvidia.model.a4000=true
However users may believe that when a GPU is removed from a node the label should be updated to false. Which is how this issue was discovered. If the label is instead removed from the node - which we instruct users to do to resolve issue - no issues will occur.