akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

AMD Support #142

Open anilmurty opened 8 months ago

anilmurty commented 8 months ago

Support for AMD GPUs on Akash Network. There may not be any significant work necessary but first step is to test with an AMD GPU(s). This is very important because AMD is working on the MI 250 chipset which is expected to be a serious contender to Nvidia A100 and H100 chips. Here is a blog from MosaicML benchmarking and comparing its performance with Nvidia's chips: https://www.mosaicml.com/blog/amd-mi250

It seems like the initial work is validating whether the kubernetes device plugin for AMD can work for us (the way the Nvidia one has) https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment

Is this something that a community person can help with?

brewsterdrinkwater commented 8 months ago

Oct 24 sync:

anilmurty commented 8 months ago

We have access to a cluster that includes a Nvidia L40 and an AMD 210 GPU. @andy108369 is working on testing out setting up a provider with them.

Current status: L40 works out of the box (as expected), AMD does not. Per @troian , we filter on "Nvidia" GPUs in nodes and providers. Artur needs to work on removing this filter and setting up a testnet for Andrey to test with. Removing this filtering likely shouldn't need a network upgrade

brewsterdrinkwater commented 7 months ago

December 5th, 2023:

brewsterdrinkwater commented 6 months ago

December 12th, 2023

brewsterdrinkwater commented 6 months ago

December 19th, 2023

Next Steps:

andy108369 commented 6 months ago

Updates:

andy108369 commented 6 months ago

Test run results

Next steps:

andy108369 commented 6 months ago
  • [x] security/helm-chart: see whether we can deploy that amd-gpu-helm/amd-gpu helm-chart in its own namespace instead of kube-system for better security;

This is possible - requires --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin flags to be specified as follows:

helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0

Verification:

root@node1:~# helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0
NAME: my-amd-gpu
LAST DEPLOYED: Mon Jan  8 12:38:25 2024
NAMESPACE: amd-device-plugin
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
amd-gpu-device-plugin-daemonset deployed in namespace 'amd-device-plugin'
brewsterdrinkwater commented 5 months ago

January 16th, 2024:

anilmurty commented 5 months ago

Additional notes: We currently have a limitation (applies to both Nvidia and AMD) where we (K8s) cannot allow mixing of models on the same node. it is fine to mix models on the provider (accross) as long as each node only has GPUs of same model.

brewsterdrinkwater commented 5 months ago

January 23rd:

andy108369 commented 5 months ago

pushed the AMD GPU support doc, now available at https://docs.akash.network/other-resources/experimental/amd-gpu-support