AMD Support - Githubissues

anilmurty commented 8 months ago

Support for AMD GPUs on Akash Network. There may not be any significant work necessary but first step is to test with an AMD GPU(s). This is very important because AMD is working on the MI 250 chipset which is expected to be a serious contender to Nvidia A100 and H100 chips. Here is a blog from MosaicML benchmarking and comparing its performance with Nvidia's chips: https://www.mosaicml.com/blog/amd-mi250

It seems like the initial work is validating whether the kubernetes device plugin for AMD can work for us (the way the Nvidia one has) https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment

Is this something that a community person can help with?

brewsterdrinkwater commented 8 months ago

Oct 24 sync:

Want to add AMD support as a follow on to Nvidia.
Anil is getting access to AMD 210 very soon. Need to make sure AMD plugin works.
Artur mentioned that nothing needs to be done with the Network. Kubernetes device installation is something to work on.
Need to update SDL parser.
Need to update provider to accept AMD as provider.

anilmurty commented 8 months ago

We have access to a cluster that includes a Nvidia L40 and an AMD 210 GPU. @andy108369 is working on testing out setting up a provider with them.

Current status: L40 works out of the box (as expected), AMD does not. Per @troian , we filter on "Nvidia" GPUs in nodes and providers. Artur needs to work on removing this filter and setting up a testnet for Andrey to test with. Removing this filtering likely shouldn't need a network upgrade

brewsterdrinkwater commented 7 months ago

December 5th, 2023:

Filter is there right now. Client needs to be updated.
Artur looking to work on that this week.
will test releases when they come out; first on Sandbox.
DOES NOT need network upgrade.

brewsterdrinkwater commented 6 months ago

December 12th, 2023

SDL part is done.
Working on provider part next
This will be tested by core team over the next couple of days

brewsterdrinkwater commented 6 months ago

December 19th, 2023

RC for provider cut this morning.
Testing is going on right now for AMD GPU support.
WIll test on SDXL app, as well.

Next Steps:

Suggested Path Forward for Validation of AMD GPU Support (can discuss further in eng sync today)
Andy and Scott will spin up providers and test the build process and deployment process.
Documentation will be created after testing is complete.

andy108369 commented 6 months ago

Updates:

provider 0.4.9-rc0 doesn't seem to register the AMD MI 210; despite MI 210 working in a K8s Pod. Details https://github.com/ovrclk/engineering/issues/810#issuecomment-1864190645

andy108369 commented 6 months ago

Test run results

:green_circle: provider version 0.4.9-rc0 (both provider & client): AMD GPU MI210 deployment works! (evidence in private repo atm)
:red_circle: provider: issue with the aggregated GPU count when mixed GPU Vendors (e.g. NVIDIA & AMD) are present on the same worker node (kubectl describe node <node-name> should only report 'nvidia.com/gpuORamd.com/gpuK8s node attribute, otherwise it will only and always see a single GPU or no GPU / Limbo [flapping between0/1gpu count]; and you cannot easily remove K8s node attributes such as 'nvidia.com/gpu, amd.com/gpu as they get stuck in etcd [K8s DB] - the only way is to reinstantiate the node.)

Next steps:

[ ] akash-provider: address the akash-provider issue with the wrong GPU count when mixed GPU Vendors on a node are present;
[x] security/helm-chart: see whether we can deploy that amd-gpu-helm/amd-gpu helm-chart in its own namespace instead of kube-system for better security;
[ ] security/helm-chart: make sure one cannot request more AMD GPU than he should.. refs. HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES (similarly to how it was possible with NVIDIA GPU via NVIDIA_VISIBLE_DEVICES=all env variable, which was addressed here ) ; refs ; Update (Jan/08/2024): - raised a question https://github.com/RadeonOpenCompute/k8s-device-plugin/issues/45
[ ] usability/helm-chart: see if we can have rocm-smi tool by default in the AMD GPU Pod (just like we get nvidia-smi tool in NVIDIA GPU Pod - which is done by the nvidia device plugin by mounting the necessary host paths and is controlled by environment variables such as NVIDIA_DRIVER_CAPABILITIES - more examples/info here) ; Update (Jan/08/2024) - raised a question https://github.com/RadeonOpenCompute/k8s-device-plugin/issues/44
[x] docs: document AMD GPU Akash Provider enablement (preliminary test version of the doc Docs: How to enable AMD GPU support in Akash Provider is in the private repo atm)

andy108369 commented 6 months ago

[x] security/helm-chart: see whether we can deploy that amd-gpu-helm/amd-gpu helm-chart in its own namespace instead of kube-system for better security;

This is possible - requires --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin flags to be specified as follows:

helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0

Verification:

root@node1:~# helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0
NAME: my-amd-gpu
LAST DEPLOYED: Mon Jan  8 12:38:25 2024
NAMESPACE: amd-device-plugin
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
amd-gpu-device-plugin-daemonset deployed in namespace 'amd-device-plugin'

brewsterdrinkwater commented 5 months ago

January 16th, 2024:

Need to update documentation.

anilmurty commented 5 months ago

Additional notes: We currently have a limitation (applies to both Nvidia and AMD) where we (K8s) cannot allow mixing of models on the same node. it is fine to mix models on the provider (accross) as long as each node only has GPUs of same model.

brewsterdrinkwater commented 5 months ago

January 23rd:

documentation being worked on this week.

andy108369 commented 5 months ago

pushed the AMD GPU support doc, now available at https://docs.akash.network/other-resources/experimental/amd-gpu-support

[x] I'll go through it once more once I get the access to the AMD GPU box.

akash-network / support

AMD Support #142

Test run results

Next steps: