alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.08k stars 358 forks source link

[BUG]Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices. #452

Closed zyc-bit closed 2 years ago

zyc-bit commented 2 years ago

Hi, I don't know if it's appropriate to file this as a bug, but it's been bugging me for a long time and I have no way to fix it.

I'm operating on a cluster. Ray saw my GPU but alpa didn't. I followed the installation documentation Install Alpa. And I confirmed I used --enable_cuda when I compiled jax-alpa. When running tests/test_install.py errors are reported, you can see the error log attached below for more details.

System information and environment

To Reproduce I ran

RAY_ADDRESS="10.140.1.112:6379" XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/cuda-11.2" srun -p caif_dev --gres=gpu:1 -w SH-IDC1-10-140-1-112 -n1 bash test_install.sh

and my test_install.sh is:

echo "ray being starting"
ray start --head --node-ip-address 10.140.0.112 --address='10.140.0.112:6379'
echo "succeed==========="
ray status
echo "now running python script"
XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/cuda-11.2" python /mnt/cache/zhangyuchang/750/alpa/tests/test_install.py
ray stop

Log

phoenix-srun: Job 126515 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal

ray being starting
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-05-10 19:50:46,487 ERROR services.py:1474 -- Failed to start the dashboard: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

2022-05-10 19:50:46,487 ERROR services.py:1475 -- Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
Traceback (most recent call last):
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/_private/services.py", line 1451, in start_dashboard
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

2022-05-10 19:50:20,155 WARN scripts.py:652 -- Specifying --address for external Redis address is deprecated. Please specify environment variable RAY_REDIS_ADDRESS=10.140.1.112:6379 instead.
2022-05-10 19:50:20,155 INFO scripts.py:659 -- Will use `10.140.1.112:6379` as external Redis server address(es). If the primary one is not reachable, we starts new one(s) with `--port` in local.
2022-05-10 19:50:20,155 INFO scripts.py:681 -- The primary external redis server `10.140.1.112:6379` is not reachable. Will starts new one(s) with `--port` in local.
2022-05-10 19:50:20,190 INFO scripts.py:697 -- Local node IP: 10.140.1.112
2022-05-10 19:50:47,423 SUCC scripts.py:739 -- --------------------
2022-05-10 19:50:47,423 SUCC scripts.py:740 -- Ray runtime started.
2022-05-10 19:50:47,423 SUCC scripts.py:741 -- --------------------
2022-05-10 19:50:47,423 INFO scripts.py:743 -- Next steps
2022-05-10 19:50:47,423 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2022-05-10 19:50:47,423 INFO scripts.py:749 --   ray start --address='10.140.1.112:6379'
2022-05-10 19:50:47,423 INFO scripts.py:752 -- Alternatively, use the following Python code:
2022-05-10 19:50:47,423 INFO scripts.py:754 -- import ray
2022-05-10 19:50:47,424 INFO scripts.py:767 -- ray.init(address='auto', _node_ip_address='10.140.1.112')
2022-05-10 19:50:47,424 INFO scripts.py:771 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-05-10 19:50:47,424 INFO scripts.py:775 -- connect to a remote cluster from your laptop directly, use the following
2022-05-10 19:50:47,424 INFO scripts.py:778 -- Python code:
2022-05-10 19:50:47,424 INFO scripts.py:780 -- import ray
2022-05-10 19:50:47,424 INFO scripts.py:786 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-05-10 19:50:47,424 INFO scripts.py:792 -- If connection fails, check your firewall settings and network configuration.
2022-05-10 19:50:47,424 INFO scripts.py:798 -- To terminate the Ray runtime, run
2022-05-10 19:50:47,424 INFO scripts.py:799 --   ray stop
succeed===========
Node status
---------------------------------------------------------------
Healthy:
 1 node_42a384b6d502cd18b6d052e98420df74116e83b73ef8146e44596910
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.0/1.0 GPU
 0.00/804.774 GiB memory
 0.00/186.265 GiB object_store_memory

Demands:
 (no resource demands)
now running python script
2022-05-10 19:53:29.234043: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:272] failed call to cuInit: UNKNOWN ERROR (34)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
EE
======================================================================
ERROR: test_1_shard_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 128, in <module>
    runner.run(suite())
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
      File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 80, in test_1_shard_parallel
    actual_state = parallel_train_step(state, batch)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 108, in ret_func
    global_config.memory_budget_per_device, *abstract_args)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 176, in parallelize_callable
    memory_budget_per_device, *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 127, in shard_parallel_callable
    *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 235, in shard_parallel_internal_gradient_accumulation
    backend = xb.get_backend("gpu")
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/monkey_patch.py", line 41, in override_get_backend
    return default_get_backend(*args, **kwargs)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/lib/xla_bridge.py", line 314, in get_backend
    return _get_backend_uncached(platform)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/lib/xla_bridge.py", line 304, in _get_backend_uncached
    raise RuntimeError(f"Backend '{platform}' failed to initialize: "
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 80, in test_1_shard_parallel
    actual_state = parallel_train_step(state, batch)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 108, in ret_func
    global_config.memory_budget_per_device, *abstract_args)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 176, in parallelize_callable
    memory_budget_per_device, *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 127, in shard_parallel_callable
    *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 235, in shard_parallel_internal_gradient_accumulation
    backend = xb.get_backend("gpu")
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/monkey_patch.py", line 41, in override_get_backend
    return default_get_backend(*args, **kwargs)
RuntimeError: Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.

======================================================================
ERROR: test_2
_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 86, in test_2_pipeline_parallel
    ray.init(address="auto")
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/worker.py", line 1072, in init
    connect_only=True,
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/node.py", line 177, in __init__
    self.validate_ip_port(self.address)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/node.py", line 332, in validate_ip_port
    _ = int(port)
ValueError: invalid literal for int() with base 10: '10.140.1.112'

----------------------------------------------------------------------
Ran 2 tests in 3.069s

FAILED (errors=2)

and my Environment Variables are:

CC=/mnt/cache/share/gcc/gcc-7.5.0/bin/gcc-7.5.0/bin/gcc
CONDA_DEFAULT_ENV=alpa_ray_7.5.0
CONDA_EXE=/mnt/cache/share/platform/env/miniconda3.7/bin/conda
CONDA_PREFIX=/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0
CONDA_PROMPT_MODIFIER='(alpa_ray_7.5.0) '
CONDA_PYTHON_EXE=/mnt/cache/share/platform/env/miniconda3.7/bin/python
CONDA_SHLVL=1
CPATH=/mnt/cache/share/cuda-11.2/targets/x86_64-linux/include/:
CUDACXX=/mnt/cache/share/cuda-11.2/bin/nvcc
CUDA_HOME=/mnt/cache/share/cuda-11.2
CUDA_PATH=/mnt/cache/share/cuda-11.2
CUDA_TOOLKIT_ROOT_DIR=/mnt/cache/share/cuda-11.2
CXX=/mnt/cache/share/gcc/gcc-7.5.0/bin/g++
HISTCONTROL=ignoredups
HISTSIZE=50000
HISTTIMEFORMAT='%F %T zhangyuchang '
HOME=/mnt/lustre/zhangyuchang
HOSTNAME=SH-IDC1-10-140-0-32
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/mnt/cache/share/cuda-11.2/lib64:/mnt/cache/share/cuda-11.2/extras/CUPTI/lib64:/mnt/cache/share/gcc/gcc-7.5.0/lib:/mnt/cache/share/gcc/gcc-7.5.0/lib64:/mnt/cache/share/gcc/gcc-7.5.0/include:/mnt/cache/share/gcc/gcc-7.5.0/bin:/mnt/cache/share/gcc/gmp-4.3.2/lib/:/mnt/cache/share/gcc/mpfr-2.4.2/lib/:/mnt/cache/share/gcc/mpc-0.8.1/lib/:/mnt/cache/share/cuda-11.2/targets/x86_64-linux/lib:/mnt/lustre/zhangyuchang/bin:/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0/lib/:/mnt/cache/share/platform/dep/binutils-2.27/lib:/mnt/cache/share/platform/dep/openmpi-4.0.5-cuda11.0/lib:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/lib64:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/lib64/:/mnt/cache/share/platform/env/miniconda3.6/lib:/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0/lib/:/mnt/cache/share/platform/dep/binutils-2.27/lib:/mnt/cache/share/platform/dep/openmpi-4.0.5-cuda11.0/lib:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/lib64:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/lib64/:/mnt/cache/share/platform/env/miniconda3.6/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64/
LESS=-R
LESSOPEN='||/usr/bin/lesspipe.sh %s'
LIBRARY_PATH=/mnt/cache/share/cuda-11.2/lib64:
LOADEDMODULES=''
LOGNAME=zhangyuchang
LSCOLORS=Gxfxcxdxbxegedabagacad
LS_COLORS='rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:'
MAIL=/var/spool/mail/zhangyuchang
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
MODULESHOME=/usr/share/Modules
NCCL_INSTALL_PATH=/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0
NCCL_SOCKET_IFNAME=eth0
PAGER=less
PATH=/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/bin:/mnt/cache/share/cuda-11.2/bin:/mnt/cache/share/gcc/gcc-7.5.0/bin/:/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray/bin:/mnt/cache/share/platform/env/miniconda3.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/mnt/lustre/zhangyuchang/bin:/mnt/lustre/zhangyuchang/bin
PWD=/mnt/cache/zhangyuchang/750/alpa
QT_GRAPHICSSYSTEM_CHECKED=1
SHELL=/usr/local/bash/bin/bash
SHLVL=3
SSH_CLIENT='10.201.32.68 45569 22'
SSH_CONNECTION='10.201.36.3 52001 10.140.0.32 22'
SSH_TTY=/dev/pts/267
TERM=screen
TF_PATH=/mnt/cache/zhangyuchang/750/tensorflow-alpa
TMUX=/tmp/tmux-200000422/default,194688,3
TMUX_PANE=%3
USER=zhangyuchang
XDG_RUNTIME_DIR=/run/user/200000422
XDG_SESSION_ID=680243
ZSH=/mnt/lustre/zhangyuchang/.oh-my-zsh
_=export
zhisbug commented 2 years ago

@zyc-bit Are you using slurm?

@TarzanZhao did you see similar issue when you tried to install Alpa on the slurm cluster?

zyc-bit commented 2 years ago

@zhisbug Thanks for reply. Yes, I'm using slurm. And I am remotely connected to a slurm cluster and do not have sudo rights.

zyc-bit commented 2 years ago

And I set TF_CPP_MIN_LOG_LEVEL=0, more information below:

2022-05-16 10:15:08.771068: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:174] XLA service 0x55e610b5f390 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-05-16 10:15:08.771098: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:182]   StreamExecutor device (0): Interpreter, <undefined>
2022-05-16 10:15:08.825745: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:176] TfrtCpuClient created.
2022-05-16 10:15:11.336151: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:272] failed call to cuInit: UNKNOWN ERROR (34)
2022-05-16 10:15:11.336788: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: SH-IDC1-10-140-1-112
2022-05-16 10:15:11.337039: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: SH-IDC1-10-140-1-112
2022-05-16 10:15:11.346278: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-05-16 10:15:11.347436: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.32.3
2022-05-16 10:15:11.546337: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
zhisbug commented 2 years ago

It seems this is a XLA + Slurm issue. XLA has trouble loading CUDA dynamic libs on slurm. To confirm that, could you try run some JAX/XLA program without Alpa and see if it works in your env?

TarzanZhao commented 2 years ago

I did not meet this error before after searching in my personal notes.

zyc-bit commented 2 years ago

@zhisbug I ran a simple JAX program, and it reported:

$ TF_CPP_MIN_LOG_LEVEL=0 srun -p caif_dev --gres=gpu:1 -n1 python jaxtest.pyphoenix-srun: Job 595975 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal

2022-05-16 14:38:13.011917: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-05-16 14:38:30.079013: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-05-16 14:38:52.076289: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:174] XLA service 0x55866d932fe0 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-05-16 14:38:52.076320: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:182]   StreamExecutor device (0): Interpreter, <undefined>
2022-05-16 14:38:52.116679: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:176] TfrtCpuClient created.
2022-05-16 14:38:52.420837: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:272] failed call to cuInit: UNKNOWN ERROR (34)
2022-05-16 14:38:52.421601: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: SH-IDC1-10-140-1-1
2022-05-16 14:38:52.421819: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: SH-IDC1-10-140-1-1
2022-05-16 14:38:52.427862: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-05-16 14:38:52.428990: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.32.3
2022-05-16 14:38:52.431424: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

So the problem should really have nothing to do with Alap. It looks like that XLA has trouble loading CUDA dynamic libs on slurm. But from the LD_LIBRARY_PATH I mentioned above, I've added all the paths I can. Any suggestions on the path? Or maybe the jax installed is CPU version?

zhisbug commented 2 years ago

I think your JAX/JAXLIB versions are correct (as long as you follow our installation guide).

When you request for a job, slurm might have trouble finding the right CUDA path. Do you know your administrater who manages that Slurm? Each Slurm has different ways of installing CUDA.

A second way to debug is: instead of asking slurm to run the job, could you ask for an interactive bash session, and try to manually launch in the bash and see if that can help you locate the correct CUDA paths.

zyc-bit commented 2 years ago

@zhisbug Thanks for reply. I follow your installation guide so the JAX version is correct. (btw, I didn't see a jaxlib in my conda list. Should it be installed also?) Our slurm's CUDA path is /mnt/cache/share/cuda-11.2/. And I googled which subpath should be add in the LD_LIBRARY_PATH, although it didn't work. Maybe I should google more. I also will try the second way you mentioned above. Once I have solved the problem, I will give feedback under this issue. Thank you again for taking the time to answer my questions

zhisbug commented 2 years ago

When you do this step 3 of this guide: https://alpa-projects.github.io/install/from_source.html#install-from-source jaxlib is compiled and installed.

If you are testing with a JAX-only env (w/o Alpa), make sure to follow here (https://github.com/google/jax#pip-installation-gpu-cuda) to install the jaxlib.

Jaxlib is required to run jax, regardless of w/ or w/o Alpa. It is the backend of JAX. And you need the CUDA version of JaxLib.

No problem. Feel free to keep up updated with your exploration.

zyc-bit commented 2 years ago

@TarzanZhao Hi~,sorry to bother you. Could you tell me the version of CUDA、cudnn、nccl and maybe openmpi and other tools' version? (Please include as many as possible) It will be helpful for me to tell my slurm cluster administrater which version to install. I plan to tell the adminstrater to install cuda-11.2 and cudnn8.1. Any other point of version should be paid attention to? It will be nice for you replying me. Looking forward for your reply.

zhuohan123 commented 2 years ago

@zyc-bit In my AWS environment, I use cuda 11.1, cudnn 8.1. Also please make sure your GPU driver version is >= 460.xx.xx. That's all of the library version constraints I can think of.

zyc-bit commented 2 years ago

@zhuohan123 Thanks a lot for replying. This really helped me. Thank you again.

@zhisbug I solved the problem that can't find gpu I mentioned above. The reason is that when compiling the jaxlib, users like me who uses slurm must compiling on srun commond. Which means compiling with a gpu.( I previously thought that compiling with just the cpu would be enough, but now it seems that you must use the gpu to do so in slurm. )

This solution above can be used as a reference for other slurm users maybe.

And now I ran into a new problem, maybe this problem is a Ray-related problem. (Please forgive me for asking this question here, because after I googled this question, I didn't find a relevant ray answer.) If this question doesn't fit under this issue, please let me know. If this is not a problem caused by Alpa, I will ask in Ray project or community. ray_wrong ray_wrong2

zhisbug commented 2 years ago

One possibility: it seems that your SRUN is not requesting a sufficient number of CPU threads for ray to work

zyc-bit commented 2 years ago

@zhisbug Thank you for your reply. And I apologize for my untimely reply to you. According to our slurm cluster setup, once a task is sent up to a compute node, it will be able to use all the cpu's of the node. so it shouldn't be a problem with the number of cpu threads. But I will continue to troubleshoot this direction. It could be an issue with the still active process that starts with (raylet) on the picture. What kind of process is it? Is it a ray process? How do I find it?

zyc-bit commented 2 years ago

I found the solution. The error caused by the Proxy of the cluster. So the slurm cluster users should check their own proxy before they using alpa on their slurm cluster. Maybe you can remind people who uses slurm cluster in your doc. I'm finally going to start using alpa, maybe other problems will heppend. I'll open another isuee if I had other problems. So this issue can be closed. Thank you all. @zhisbug @zhuohan123 @TarzanZhao