autoscaling on a fractional custom metric with gpu

chakpongchung commented 5 months ago

Hi Kevin,

I ran into a potential bug when using nvkind with 8x a100 for autoscaling. I wonder whether it is related to nvkind itself.

The potential bug could be related to the following: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

In my HPA experiment, desiredMetricValue is set to 90 and currentReplicas starts with 1

Given a workload pattern hitting the service which is able to keep increasing currentMetricValue to be above desiredMetricValue (90 in this case), I expect the desiredReplicas eventually to become the max replica(say 8). However, what I observed is that desiredReplicas and currentReplicas will increase to 2 and stay at 2.

Is it possible that currentReplicas does not reflect the current state? My hypoethesis comes from the observation that if I set the desiredMetricValue to be small value like 1, then with currentMetricValue to be over 90 I am able to see the desiredReplicas increases to the max replica.

I wish there is an easy way I can share the script to reproduce the bug. If that is helpful, I can try talking to someone.

klueska commented 5 months ago

Sorry, I don't quite follow your question. Where is this equation defined, and how is it related to nvkind?

chakpongchung commented 5 months ago

Thank you for your reply. The equation is defined in the HPA link: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

I updated the question with a link and more context. Basically I am running HPA experiments with nvkind for a cluster of 8x A100. Kindly let me know if this is still not enough and I can try to elaborate more.

klueska commented 5 months ago

Is it stuck at 2 because it thinks there are no more GPUs available to hand out? How have you divided GPUs among worker nodes? How many GPUs is each replica consuming?

chakpongchung commented 5 months ago

From nvidia-smi, it shows 6 gpus are idle in a cluster of 8x A100 when the current replica is stucked at 2.

How have you divided GPUs among worker nodes? How many GPUs is each replica consuming?

Is there a way I can print the cluster info or config map to answer this?

klueska commented 5 months ago

As noted in the readme you can run the following to see how they are divided up:

./nvkind cluster print-gpus

klueska commented 5 months ago

It would also be important to see how you are launching your jobs, i.e. what the value of nvidia.com/gpu: <count> is for your workloads.

chakpongchung commented 5 months ago

nvkind cluster print-gpus

$ nvkind cluster print-gpus
[
    {
        "node": "nvkind-chm87-worker",
        "gpus": [
            {
                "Index": "0",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-b1754135-2e06-339e-3067-f2d365e64847"
            },
            {
                "Index": "1",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-d9f0d726-9781-3004-eae1-8bb9ede8e64b"
            },
            {
                "Index": "2",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-425356bf-c1be-0b28-16c6-3ff493d9cb2a"
            },
            {
                "Index": "3",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-1c892074-1dc3-0e14-7838-365b36003d39"
            },
            {
                "Index": "4",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-c5e7a5d1-393d-52e2-ea37-4f3667deae2b"
            },
            {
                "Index": "5",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-988fb4ac-6536-48f5-75a6-785a3aa0cb38"
            },
            {
                "Index": "6",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-b832b430-4f61-2c9c-638e-6f8c33f417d7"
            },
            {
                "Index": "7",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-3576ce9a-b900-8aec-829f-5d447920ef6f"
            }
        ]
    }
]

i.e. what the value of nvidia.com/gpu: is for your workloads.

        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"

klueska commented 5 months ago

I'm not familiar with HPA, but everything looks good from the perspective of nvkind. Is it not doing what's expected because you only have a single worker node? Does it make more sense to have one GPU per node (there are instructions on how to do this in the nvkind README).

chakpongchung commented 5 months ago

Is it not doing what's expected because you only have a single worker node?

The machine I am using has 8x A100 ( on the same motherboard I believe) . I believe this is what you call a single node?

Does it make more sense to have one GPU per node

It does not seem like I can change this as the cluster I am using is from lambdas lab..

So it looks like the node spec (single node with multiple gpus on it) I am using falls out of the intended use cases of nvkind? Perhaps that is the reason why currentReplicas =1 in this setup as we only have a single node here? If that is the case, is it something we can change in nvkind so it can adapt to this node spec?

klueska commented 5 months ago

I meant run nvkid as:

./nvkind cluster create \
--name=evenly-distributed-1-by-8 \
--config-template=examples/equally-distributed-gpus.yaml \
--config-values=- \
<<EOF
numWorkers: 8
EOF

This will create 8 separate worker nodes in your kind cluster (each with 1 GPU), on top of your single physical node.

klueska commented 5 months ago

So it looks like the node spec (single node with multiple gpus on it) I am using falls out of the intended use cases of nvkind? Perhaps that is the reason why currentReplicas =1 in this setup as we only have a single node here? If that is the case, is it something we can change in nvkind so it can adapt to this node spec?

It's perfectly valid to run nvkind as you are doing. I just don't know enough about HPA to know if it support this.

chakpongchung commented 5 months ago

Does the output here look good to you? It does not seem to increase the number of nodes. E.g. I only see one node

"node": "nvkind-85kqx-worker"

ubuntu@138-2-2-161:$ nvkind cluster create --config-template=single-cluster.yaml --config-values=- \            
<<EOF
numWorkers: 8
EOF
Creating cluster "nvkind-85kqx" ...
 ✓ Ensuring node image (kindest/node:v1.29.2) 🖼
 ✓ Preparing nodes 📦 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining worker nodes 🚜 
Set kubectl context to "kind-nvkind-85kqx"
You can now use your cluster with:

kubectl cluster-info --context kind-nvkind-85kqx

Have a nice day! 👋
...
...
...
time="2024-06-12T21:20:46Z" level=info msg="It is recommended that containerd daemon be restarted."
ubuntu@138-2-2-161:~$ nvkind cluster print-gpus
[
    {
        "node": "nvkind-85kqx-worker",
        "gpus": [
            {
                "Index": "0",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-5fea1621-cbbd-4044-8f65-704e9137a7e0"
            },
            {
                "Index": "1",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-e9b5394f-f9c4-deda-ae60-95d1960b8f04"
            },
            {
                "Index": "2",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-483abd92-f950-4264-745d-d6de2c0085d7"
            },
            {
                "Index": "3",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-593f8fb0-b4fe-6ad3-ced4-ac96f9fc1d4d"
            },
            {
                "Index": "4",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-e2a71e73-895b-a4ae-7ee9-167e18af1063"
            },
            {
                "Index": "5",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-1690085b-79a9-eb18-775f-e84117a0b64f"
            },
            {
                "Index": "6",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-173f4dd3-58f4-a424-8c5a-acc7583ebcd2"
            },
            {
                "Index": "7",
                "Name": "NVIDIA A100-SXM4-40GB",
                "UUID": "GPU-26ac79a0-1a0a-0273-5e6b-fae463253058"
            }
        ]
    }
]

klueska commented 5 months ago

That’s using a different config file than mine. If you have a custom one you need to adapt it to include something similar to the one I referenced to generate multiple worker nodes.

chakpongchung commented 5 months ago

I have modified the single-cluster.yaml in my example to be the following:

nvkind cluster create --config-template=single-cluster.yaml --config-values=config_value.txt

{{- $gpus_per_worker := div numGPUs $.numWorkers }}
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
{{- if hasKey $ "name" }}
name: {{ $.name }}
{{- end }}
nodes:
- role: control-plane
  {{- if hasKey $ "image" }}
  image: {{ $.image }}
  {{- end }}
{{- range $worker := until $.numWorkers }}
- role: worker
  {{- if hasKey $ "image" }}
  image: {{ $.Image }}
  {{- end }}

  {{- $gpu_beg_id := mul $worker $gpus_per_worker | int }}
  {{- $gpu_end_id := add $gpu_beg_id $gpus_per_worker | int }}
  {{- $gpus := untilStep $gpu_beg_id $gpu_end_id 1 }}
  extraMounts:
    # We inject all NVIDIA GPUs using the nvidia-container-runtime.
    # This requires `accept-nvidia-visible-devices-as-volume-mounts = true` be set
    # in `/etc/nvidia-container-runtime/config.toml`
    {{- range $gpu := $gpus }}
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
    - hostPath: /tmp/models/
      containerPath: /data/models/
    {{- end }}
  # this is for gateway service, make it exposed to public
  extraPortMappings:
  - containerPort: 30950
    hostPort: 4000
  # Reserved for prometheus & grafana ports. Update service with NodePort 30090 or 30030 if you need to debug the metrics
  - containerPort: 30090
    hostPort: 9090
  - containerPort: 30030
    hostPort: 3000
{{- end }}

Looks like where to put the last {{- end}} is the culprit here with the following error. Or whether we should have one extra gpu to have the extraPortMappings? command used?

nvkind cluster create --config-template=single-cluster.yaml --config-values=- \            
<<EOF
numWorkers: 8
EOF

ERROR: failed to create cluster: command "docker run --name nvkind-fbxhf-worker8 --hostname nvkind-fbxhf-worker8 --label io.x-k8s.kind.role=worker --privileged --security-opt seccomp=unconfined --security-opt apparmor=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro -e KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER --detach --tty --label io.x-k8s.kind.cluster=nvkind-fbxhf --net kind --restart=on-failure:1 --init=false --cgroupns=private --volume=/dev/null:/var/run/nvidia-container-devices/all --volume=/tmp/models/:/data/models/ --publish=0.0.0.0:4000:30950/TCP --publish=0.0.0.0:9090:30090/TCP --publish=0.0.0.0:3000:30030/TCP kindest/node:v1.29.2@sha256:51a1434a5397193442f0be2a297b488b6c919ce8a3931be0ce822606ea5ca245" failed with error: exit status 125
Command Output: 3ec27ac21ad67cea2a8f37ccf51691f51c684fef9f717c0dca8de0e01ebe7f9b
docker: Error response from daemon: driver failed programming external connectivity on endpoint nvkind-fbxhf-worker8 (f65a5e3794c3f876c9ba8ed9dc74f10cd22bbc4e8af17295c84a10bc77a5388c): Bind for 0.0.0.0:4000 failed: port is already allocated.
F0612 22:31:02.669203  188351 main.go:45] Error: creating cluster: executing command: exit status 1

What is the correct way to write it?

klueska commented 5 months ago

The error is because you are trying to start each worker node with the same host port. You need a different host port for each worker node, the same way you need a different GPU for each worker node. Also, this is wrong:

containerPath: /var/run/nvidia-container-devices/all

It needs to be:

containerPath: /var/run/nvidia-container-devices/{{ $gpu }}

so that each worker gets a different GPU

NVIDIA / nvkind

autoscaling on a fractional custom metric with gpu #3