NCCL Error when running the Jax LWM-Chat-32K-Jax #44

Closed yfb-xieyu closed 4 months ago

yfb-xieyu commented 4 months ago

Environment GPUs: 8x4090

Package Version

Error Messasge

I0223 22:58:05.579038 140312230876992] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA I0223 22:58:05.579842 140312230876992] Unable to initialize backend 'tpu': INTERNAL: Failed to open cannot open shared object file: No such file or directory 100%|██████████| 1/1 [00:09<00:00, 9.21s/it] 2024-02-23 23:00:08.992159: W external/xla/xla/service/gpu/runtime/] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'. 2024-02-23 23:00:08.992208: W external/xla/xla/service/gpu/runtime/] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'. 2024-02-23 23:00:08.992237: W external/xla/xla/service/gpu/runtime/] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'. 2024-02-23 23:00:08.992261: W external/xla/xla/service/gpu/runtime/] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'. 2024-02-23 23:00:08.992281: W external/xla/xla/service/gpu/runtime/] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'. 2024-02-23 23:00:08.992348: W external/xla/xla/service/gpu/runtime/] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'. 2024-02-23 23:00:08.992392: W external/xla/xla/service/gpu/runtime/] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'. 2024-02-23 23:00:08.992430: W external/xla/xla/service/gpu/runtime/] Intercepted XLA runtime error: INTERNAL: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'. 2024-02-23 23:00:08.992459: E external/xla/xla/pjrt/] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#. 2024-02-23 23:00:08.992472: E external/xla/xla/pjrt/] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#. 2024-02-23 23:00:08.992483: E external/xla/xla/pjrt/] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#. 2024-02-23 23:00:08.992499: E external/xla/xla/pjrt/] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#. 2024-02-23 23:00:08.992510: E external/xla/xla/pjrt/] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#. 2024-02-23 23:00:08.992522: E external/xla/xla/pjrt/] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#. 2024-02-23 23:00:08.992536: E external/xla/xla/pjrt/] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#. 2024-02-23 23:00:08.992551: E external/xla/xla/pjrt/] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#. jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/users/yu01.xie/software/anaconda3/envs/lwm/lib/python3.10/", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/users/yu01.xie/software/anaconda3/envs/lwm/lib/python3.10/", line 86, in _run_code exec(code, run_globals) File "/home/users/yu01.xie/project/mllm/LWM-main/lwm/", line 254, in run(main) File "/home/users/yu01.xie/software/anaconda3/envs/lwm/lib/python3.10/site-packages/absl/", line 308, in run _run_main(main, args) File "/home/users/yu01.xie/software/anaconda3/envs/lwm/lib/python3.10/site-packages/absl/", line 254, in _run_main sys.exit(main(argv)) File "/home/users/yu01.xie/project/mllm/LWM-main/lwm/", line 250, in main output = sampler(prompts, FLAGS.max_n_frames)[0] File "/home/users/yu01.xie/project/mllm/LWM-main/lwm/", line 230, in call output, self.sharded_rng = self._forward_generate( jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.reduce_scatter' failed: external/xla/xla/service/gpu/ NCCL operation ncclGetUniqueId(&id) failed: Unable to load NCCL library. Multi-GPU collectives will not work.. Last NCCL warning(error) log entry (may be unrelated) 'Unable to load NCCL library. Multi-GPU collectives will not work.'.; current tracing scope: reduce-scatter-start.4; current profiling annotation: XlaModule:#hlo_module=pjit_fn,program_id=50#.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

yfb-xieyu commented 4 months ago

Sorry, the package version should be absl-py 2.1.0 aiohttp 3.9.3 aiosignal 1.3.1 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 cachetools 5.3.2 certifi 2024.2.2 charset-normalizer 3.3.2 chex 0.1.82 click 8.1.7 cloudpickle 3.0.0 contextlib2 21.6.0 datasets 2.13.0 decorator 5.1.1 decord 0.6.0 dill 0.3.6 docker-pycreds 0.4.0 einops 0.7.0 etils 1.7.0 exceptiongroup 1.2.0 executing 2.0.1 filelock 3.13.1 flax 0.7.0 frozenlist 1.4.1 fsspec 2024.2.0 gcsfs 2024.2.0 gitdb 4.0.11 GitPython 3.1.42 google-api-core 2.17.1 google-auth 2.28.0 google-auth-oauthlib 1.2.0 google-cloud-core 2.4.1 google-cloud-storage 2.14.0 google-crc32c 1.5.0 google-resumable-media 2.7.0 googleapis-common-protos 1.62.0 huggingface-hub 0.20.3 idna 3.6 imageio 2.34.0 imageio-ffmpeg 0.4.9 importlib-resources 6.1.1 ipdb 0.13.13 ipython 8.21.0 jax 0.4.23 jaxlib 0.4.23+cuda11.cudnn86 jedi 0.19.1 markdown-it-py 3.0.0 matplotlib-inline 0.1.6 mdurl 0.1.2 ml-collections 0.1.1 ml-dtypes 0.3.2 msgpack 1.0.7 multidict 6.0.5 multiprocess 0.70.14 nest-asyncio 1.6.0 numpy 1.26.4 nvidia-cublas-cu11 nvidia-cuda-nvcc-cu11 11.8.89 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 nvidia-cufft-cu11 nvidia-cusolver-cu11 nvidia-cusparse-cu11 oauthlib 3.2.2 opt-einsum 3.3.0 optax 0.1.7 orbax-checkpoint 0.5.3 packaging 23.2 pandas 2.2.0 parso 0.8.3 pexpect 4.9.0 pillow 10.2.0 pip 23.3.1 prompt-toolkit 3.0.43 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 15.0.0 pyasn1 0.5.1 pyasn1-modules 0.3.0 Pygments 2.17.2 python-dateutil 2.8.2 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 requests-oauthlib 1.3.1 rich 13.7.0 rsa 4.9 scipy 1.12.0 sentencepiece 0.2.0 sentry-sdk 1.40.5 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 smmap 5.0.1 stack-data 0.6.3 tensorstore 0.1.53 tiktoken 0.6.0 tokenizers 0.13.3 tomli 2.0.1 toolz 0.12.1 tqdm 4.66.2 traitlets 5.14.1 transformers 4.29.2 tux 0.0.2 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.2.1 wandb 0.16.3 wcwidth 0.2.13 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.17.0