Open shimomut opened 2 weeks ago
looks like it's missing the nvidia gpu device plugin or you didn't specify the number of devices in your yaml like so:
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
Reproduced the error with 2 x g5.8xlarge nodes. I believe the root cause is: "system has unsupported display driver / cuda driver combination"
Related to #475
@shimomut please add link to test case. There are multiple FSDP test cases in the repo.
This is the FSDP sample I faced the issue: https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/README-EKS.md
Updated the original post as well.
Removing the Cuda compat package will likely resolve this.
When running the FSDP sample app on HyperPod EKS cluster, I got this error.
Cluster spec: