operator-inventory should wait for the nvdp to fully init before deciding on the GPU count

andy108369 commented 6 months ago

Currently, the operator-inventory reports 0 GPUs upon first install or server reboot unless manually restarted. I think operator-inventory should wait for the nvdp to fully init before deciding on the GPU count.

me and Damir have been deploying ~5 providers with provider v0.5.11 and generally have been working with the providers quite extensively over this week.

We have noticed that operator-inventory would almost always report 0 GPU's upon first provider install or after server reboot.

The workaround is to simply bounce it, e.g.:

kubectl rollout restart deployment/operator-inventory -n akash-services

Here is that, operator-inventory should be more robust, i.e. if it doesn't detect the GPU upon first run or after server reboot - it should not wait for admin to kick it.

And the "first install" case could be explained as the operator-inventory gets installed first, before we install nvdp (nvidia-device) plugin.

The restart might be the same case - it is highly likely nvdp plugin can't init in time and detect the GPU, while operator-inventory has already been initialized.

andy108369 commented 5 months ago

it's possible that provider 0.5.12 fixed this; but yet to confirm

chainzero commented 5 months ago

@andy108369 will validate in near future but not a critical matter.

TormenTeDx commented 5 months ago

It just happened to me.

Sometimes when you shutdown provider it can show wrong values on gpus, whenever you shutdown a node for more than 5 minutes it may forget to label node again and it shows 0 gpus, normal restart won't cause this, you have to shutdown a node for ~5 minutes. Then a node isn't labeled properly and it causes that 0 gpu value. So after each shutdown I have to verify if a node is actually labeled, if not then just bouncing the inventory pod fixes it.

Here is my versions:

root@node1:~# helm list -A
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                            APP VERSION
akash-hostname-operator akash-services          4               2024-04-16 14:30:27.443699966 +0000 UTC deployed        akash-hostname-operator-9.1.3    0.5.13     
akash-node              akash-services          2               2024-04-16 14:36:42.381124574 +0000 UTC deployed        akash-node-9.0.3                 0.32.3     
akash-provider          akash-services          16              2024-04-16 14:31:46.980977588 +0000 UTC deployed        provider-9.2.6                   0.5.13     
ingress-nginx           ingress-nginx           3               2024-03-05 12:54:08.500925969 +0000 UTC deployed        ingress-nginx-4.10.0             1.10.0     
inventory-operator      akash-services          5               2024-04-16 14:30:35.530259476 +0000 UTC deployed        akash-inventory-operator-9.1.3   0.5.13     
nvdp                    nvidia-device-plugin    3               2024-03-05 14:26:41.729594744 +0000 UTC deployed        nvidia-device-plugin-0.14.5      0.14.5

andy108369 commented 3 months ago

Looks like still happens:

I think the issue is primary related to k8s-device-plugin limitations:

This functionality is not production ready and includes a number of known issues including:

The device plugin may show as started before it is ready to allocate shared GPUs while waiting for the CUDA MPS control daemon to come online.

Source: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0

Yet, I think operator-inventory introduce some sort of workarounds to that.

andy108369 commented 1 month ago

The issue still persists after node reboots, when nvdp didn't fully init yet but operator-inventory starts querying it too earily:

provider 0.6.2 akash 0.36.0

akash-network / support

operator-inventory should wait for the nvdp to fully init before deciding on the GPU count #207