aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
202 stars 85 forks source link

FSDP sample fails with CUDA initialization error on HyperPod EKS #467

Open shimomut opened 2 weeks ago

shimomut commented 2 weeks ago

When running the FSDP sample app on HyperPod EKS cluster, I got this error.

[W CUDAFunctions.cpp:108] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (function operator())
Traceback (most recent call last):
  File "/fsdp/train.py", line 281, in <module>
    main(args)
  File "/fsdp/train.py", line 144, in main
    dist.init_process_group()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1339, in _new_process_group_helper
    backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Cluster spec:

sean-smith commented 2 weeks ago

looks like it's missing the nvidia gpu device plugin or you didn't specify the number of devices in your yaml like so:

resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
iankouls-aws commented 2 weeks ago

Reproduced the error with 2 x g5.8xlarge nodes. I believe the root cause is: "system has unsupported display driver / cuda driver combination"

mhuguesaws commented 1 week ago

Related to #475

mhuguesaws commented 1 week ago

@shimomut please add link to test case. There are multiple FSDP test cases in the repo.

shimomut commented 1 week ago

This is the FSDP sample I faced the issue: https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/README-EKS.md

Updated the original post as well.

mhuguesaws commented 1 week ago

Removing the Cuda compat package will likely resolve this.