G-Research / unicorn-history-server

A service to store and provide historical data for K8S clusters using the Yunikorn scheduler
Apache License 2.0
10 stars 11 forks source link

Add endpoint to get User and Groups resource usage #104

Open richscott opened 4 months ago

richscott commented 4 months ago

Add a handler (and prerequisite database layer structures/code, if they don't already exist) to replicate the yunikorn-core endpoints

to allow YHS users to get resource usage metrics for users and groups, at aggregate and individual levels.

sudiptob2 commented 4 months ago

Hi, I am interested in this issue.

sudiptob2 commented 3 months ago

Goal

we want to answer the question "How much resource is being used by a user or group? for a specific duration of time?"

  1. User-specific resource usage
  2. Group-specific resource usage

Existing Resource Usage API in Yunikorn Core

  1. /ws/v1/partition/{partitionName}/usage/users
  2. /ws/v1/partition/{partitionName}/usage/user/{userName}
  3. /ws/v1/partition/{partitionName}/usage/groups
  4. /ws/v1/partition/{partitionName}/usage/group/{groupName}

These APIs return resource usage of queues in a hierarchical response. For our purpose, a similar response will not be useful because they do not consider historical resource usage. Also, they are serving resource usage of queues, but multiple users can deploy into the same queue. Also, the Queue creator might not be the user, who is deploying the application.

Sample Response of ynikorn-core resource usage endpoints

[
  {
    "userName": "user1",
    "groups": {
      "app2": "tester"
    },
    "queues":
    {
      "queuePath": "root",
      "resourceUsage": {...},
      "children": [
        {
        "queuePath": "root.default",
        "resourceUsage": {...},
        "children": [
            {
                "queuePath": "root.default.test",
                "resourceUsage": {
                    "memory": 6000000000,
                    "vcore": 6000
                },
                "children": [...]
            }
        ]
        }]
    }
  }
]

Problem with returning resource usage by queues

Scenario 1: Let's say, user-a and user-b both have submitted job to the same queue. Now if we get the resource usage of the queue, we will only get the total resource usage of the queue (referring to the current DB)

Probable Solution

resource usage for each application is stored in the application table. Also, the information of users and groups is tracked. So we might be able to get the resource usage of a user or group by querying the application table.

Some useful columns in the application table are:

  1. user
  2. queue_name
  3. used_resource
  4. max_used_resource **
  5. pending_resource
  6. state_log So, we can simplify the response to the following format.
[
  {
    "userName": "user1",
    "groups": {
      "app2": "tester"
    },
    "applications": 
    [
      {
        "queuePath": "root.default.test",
        "app_id": "app1",
        "maxUsedResource": {
          "memory": 6000000000,
          "vcore": 6000
        }
      }
    ]     
  }
]

Questions

  1. Can we answer the resource usage of a user for a specific duration? :: we don't store resource usage in a time-series format.