akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

Provider Stats Endpoint fails to account for Service Count in GPU deployments #147

Open andy108369 opened 10 months ago

andy108369 commented 10 months ago

Description: There is an issue in the provider stats endpoint concerning GPU utilization reporting, specifically when handling deployments requesting GPUs across service count >1. This problem is evident in provider version 0.4.7 and Akash Network version 0.26.2.

Issue Details: The current implementation of the provider stats endpoint does not correctly factor in the 'service count' for deployments that request GPUs. This results in an inaccurate display of the total GPU usage & availability.

Example Scenario: Consider a GPU deployment consisting of two services:

  1. First service with count: 14 and gpu: 2.
  2. Second service with count: 1 and gpu: 2.

Theoretically, the total GPU usage should be 30 (calculated as 14*2 + 1*2), but this is not reflected in the provider stats.

Observed Output: For the provider at provider.akash-ai.com (akash1c6rsz4f59nkus3s5qauxxh969j2mtkkn2clk2e), the stats endpoint incorrectly reports only 4 GPUs in use (should be 30 in use). The script output is as follows (based on the :8443/stats report you can see below):

$ provider_info.sh provider.akash-ai.com
type       cpu      gpu  ram                 ephemeral          persistent
used       180      4    428                 3700               0
pending    0        0    0                   0                  0
available  564.8    2    1735.996597290039   3378.038669425994  0
node       171      0    448.06262588500977  869.5096673564985  N/A
node       170.78   0    447.92784881591797  869.5096673564985  N/A
node       171.495  0    447.97326850891113  869.5096673564985  N/A
node       51.525   2    392.0328540802002   769.5096673564985  N/A

Expected Behavior: The provider stats endpoint should accurately represent the total number of GPUs in use, incorporating the 'service count' in its calculation for deployments with GPU requests.

Impact: This inaccurate reporting can lead to misunderstandings regarding resource availability and utilization, potentially affecting scheduling decisions and overall resource management on the Akash Network.


Additional info

root@node1:~# kubectl get deployment -A -o yaml | grep -Ei 'gpu|readyReplicas'
...
    readyReplicas: 1
                - key: akash.network/capabilities.gpu.vendor.nvidia.model.a100
              nvidia.com/gpu: "2"
              nvidia.com/gpu: "2"
    readyReplicas: 1
                - key: akash.network/capabilities.gpu.vendor.nvidia.model.a100
          image: REDACTED
              nvidia.com/gpu: "2"
              nvidia.com/gpu: "2"
    readyReplicas: 14
root@node1:~# 
$ curl -s -k https://provider.akash-ai.com:8443/status | jq -r . 
{
  "cluster": {
    "leases": 1,
    "inventory": {
      "active": [
        {
          "cpu": 180000,
          "gpu": 4,
          "memory": 459561500672,
          "storage_ephemeral": 3972844748800
        }
      ],
      "available": {
        "nodes": [
          {
            "cpu": 171000,
            "gpu": 0,
            "memory": 481103581184,
            "storage_ephemeral": 933628896213
          },
          {
            "cpu": 170780,
            "gpu": 0,
            "memory": 480958865408,
            "storage_ephemeral": 933628896213
          },
          {
            "cpu": 171495,
            "gpu": 0,
            "memory": 481007634432,
            "storage_ephemeral": 933628896213
          },
          {
            "cpu": 51525,
            "gpu": 2,
            "memory": 420942071808,
            "storage_ephemeral": 826254713813
          }
        ]
      }
    }
  },
  "bidengine": {
    "orders": 0
  },
  "manifest": {
    "deployments": 0
  },
  "cluster_public_hostname": "provider.akash-ai.com",
  "address": "akash1c6rsz4f59nkus3s5qauxxh969j2mtkkn2clk2e"
}