Open anilmurty opened 8 months ago
Oct 24 sync:
We have access to a cluster that includes a Nvidia L40 and an AMD 210 GPU. @andy108369 is working on testing out setting up a provider with them.
Current status: L40 works out of the box (as expected), AMD does not. Per @troian , we filter on "Nvidia" GPUs in nodes and providers. Artur needs to work on removing this filter and setting up a testnet for Andrey to test with. Removing this filtering likely shouldn't need a network upgrade
December 5th, 2023:
December 12th, 2023
December 19th, 2023
Next Steps:
Updates:
0.4.9-rc0
(both provider & client): AMD GPU MI210
deployment works! (evidence in private repo atm)count
when mixed GPU Vendors (e.g. NVIDIA & AMD) are present on the same worker node (kubectl describe node <node-name>
should only report 'nvidia.com/gpuOR
amd.com/gpuK8s node attribute, otherwise it will only and always see a single GPU or no GPU / Limbo [flapping between
0/1gpu count]; and you cannot easily remove K8s node attributes such as 'nvidia.com/gpu
, amd.com/gpu
as they get stuck in etcd
[K8s DB] - the only way is to reinstantiate the node.)count
when mixed GPU Vendors on a node are present;amd-gpu-helm/amd-gpu
helm-chart in its own namespace instead of kube-system
for better security;HIP_VISIBLE_DEVICES
/ ROCR_VISIBLE_DEVICES
(similarly to how it was possible with NVIDIA GPU via NVIDIA_VISIBLE_DEVICES=all
env variable, which was addressed here ) ; refs ; Update (Jan/08/2024): - raised a question https://github.com/RadeonOpenCompute/k8s-device-plugin/issues/45rocm-smi
tool by default in the AMD GPU Pod (just like we get nvidia-smi
tool in NVIDIA GPU Pod - which is done by the nvidia device plugin by mounting the necessary host paths and is controlled by environment variables such as NVIDIA_DRIVER_CAPABILITIES
- more examples/info here) ; Update (Jan/08/2024) - raised a question https://github.com/RadeonOpenCompute/k8s-device-plugin/issues/44Docs: How to enable AMD GPU support in Akash Provider
is in the private repo atm)
- [x] security/helm-chart: see whether we can deploy that amd-gpu-helm/amd-gpu helm-chart in its own namespace instead of kube-system for better security;
This is possible - requires --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin
flags to be specified as follows:
helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0
Verification:
root@node1:~# helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0
NAME: my-amd-gpu
LAST DEPLOYED: Mon Jan 8 12:38:25 2024
NAMESPACE: amd-device-plugin
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
amd-gpu-device-plugin-daemonset deployed in namespace 'amd-device-plugin'
January 16th, 2024:
Additional notes: We currently have a limitation (applies to both Nvidia and AMD) where we (K8s) cannot allow mixing of models on the same node. it is fine to mix models on the provider (accross) as long as each node only has GPUs of same model.
January 23rd:
pushed the AMD GPU support doc, now available at https://docs.akash.network/other-resources/experimental/amd-gpu-support
Support for AMD GPUs on Akash Network. There may not be any significant work necessary but first step is to test with an AMD GPU(s). This is very important because AMD is working on the MI 250 chipset which is expected to be a serious contender to Nvidia A100 and H100 chips. Here is a blog from MosaicML benchmarking and comparing its performance with Nvidia's chips: https://www.mosaicml.com/blog/amd-mi250
It seems like the initial work is validating whether the kubernetes device plugin for AMD can work for us (the way the Nvidia one has) https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment
Is this something that a community person can help with?