Open andy108369 opened 6 months ago
it's possible that provider 0.5.12 fixed this; but yet to confirm
@andy108369 will validate in near future but not a critical matter.
It just happened to me.
Sometimes when you shutdown provider it can show wrong values on gpus, whenever you shutdown a node for more than 5 minutes it may forget to label node again and it shows 0 gpus, normal restart won't cause this, you have to shutdown a node for ~5 minutes. Then a node isn't labeled properly and it causes that 0 gpu value. So after each shutdown I have to verify if a node is actually labeled, if not then just bouncing the inventory pod fixes it.
Here is my versions:
root@node1:~# helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
akash-hostname-operator akash-services 4 2024-04-16 14:30:27.443699966 +0000 UTC deployed akash-hostname-operator-9.1.3 0.5.13
akash-node akash-services 2 2024-04-16 14:36:42.381124574 +0000 UTC deployed akash-node-9.0.3 0.32.3
akash-provider akash-services 16 2024-04-16 14:31:46.980977588 +0000 UTC deployed provider-9.2.6 0.5.13
ingress-nginx ingress-nginx 3 2024-03-05 12:54:08.500925969 +0000 UTC deployed ingress-nginx-4.10.0 1.10.0
inventory-operator akash-services 5 2024-04-16 14:30:35.530259476 +0000 UTC deployed akash-inventory-operator-9.1.3 0.5.13
nvdp nvidia-device-plugin 3 2024-03-05 14:26:41.729594744 +0000 UTC deployed nvidia-device-plugin-0.14.5 0.14.5
Looks like still happens:
I think the issue is primary related to k8s-device-plugin
limitations:
This functionality is not production ready and includes a number of known issues including:
- The device plugin may show as started before it is ready to allocate shared GPUs while waiting for the CUDA MPS control daemon to come online.
Source: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0
Yet, I think operator-inventory
introduce some sort of workarounds to that.
The issue still persists after node reboots, when nvdp didn't fully init yet but operator-inventory starts querying it too earily:
provider 0.6.2 akash 0.36.0
Currently, the operator-inventory reports 0 GPUs upon first install or server reboot unless manually restarted. I think operator-inventory should wait for the nvdp to fully init before deciding on the GPU count.
me and Damir have been deploying ~5 providers with provider v0.5.11 and generally have been working with the providers quite extensively over this week.
We have noticed that operator-inventory would almost always report 0 GPU's upon first provider install or after server reboot.
The workaround is to simply bounce it, e.g.:
Here is that, operator-inventory should be more robust, i.e. if it doesn't detect the GPU upon first run or after server reboot - it should not wait for admin to kick it.
And the "first install" case could be explained as the operator-inventory gets installed first, before we install nvdp (nvidia-device) plugin.
The restart might be the same case - it is highly likely nvdp plugin can't init in time and detect the GPU, while operator-inventory has already been initialized.