Closed zyc-bit closed 2 years ago
@zyc-bit Are you using slurm?
@TarzanZhao did you see similar issue when you tried to install Alpa on the slurm cluster?
@zhisbug Thanks for reply. Yes, I'm using slurm. And I am remotely connected to a slurm cluster and do not have sudo rights.
And I set TF_CPP_MIN_LOG_LEVEL=0
, more information below:
2022-05-16 10:15:08.771068: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:174] XLA service 0x55e610b5f390 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-05-16 10:15:08.771098: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:182] StreamExecutor device (0): Interpreter, <undefined>
2022-05-16 10:15:08.825745: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:176] TfrtCpuClient created.
2022-05-16 10:15:11.336151: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:272] failed call to cuInit: UNKNOWN ERROR (34)
2022-05-16 10:15:11.336788: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: SH-IDC1-10-140-1-112
2022-05-16 10:15:11.337039: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: SH-IDC1-10-140-1-112
2022-05-16 10:15:11.346278: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-05-16 10:15:11.347436: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.32.3
2022-05-16 10:15:11.546337: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
It seems this is a XLA + Slurm issue. XLA has trouble loading CUDA dynamic libs on slurm. To confirm that, could you try run some JAX/XLA program without Alpa and see if it works in your env?
I did not meet this error before after searching in my personal notes.
@zhisbug I ran a simple JAX program, and it reported:
$ TF_CPP_MIN_LOG_LEVEL=0 srun -p caif_dev --gres=gpu:1 -n1 python jaxtest.pyphoenix-srun: Job 595975 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal
2022-05-16 14:38:13.011917: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-05-16 14:38:30.079013: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-05-16 14:38:52.076289: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:174] XLA service 0x55866d932fe0 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-05-16 14:38:52.076320: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:182] StreamExecutor device (0): Interpreter, <undefined>
2022-05-16 14:38:52.116679: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:176] TfrtCpuClient created.
2022-05-16 14:38:52.420837: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:272] failed call to cuInit: UNKNOWN ERROR (34)
2022-05-16 14:38:52.421601: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: SH-IDC1-10-140-1-1
2022-05-16 14:38:52.421819: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: SH-IDC1-10-140-1-1
2022-05-16 14:38:52.427862: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-05-16 14:38:52.428990: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.32.3
2022-05-16 14:38:52.431424: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
So the problem should really have nothing to do with Alap. It looks like that XLA has trouble loading CUDA dynamic libs on slurm. But from the LD_LIBRARY_PATH
I mentioned above, I've added all the paths I can. Any suggestions on the path? Or maybe the jax
installed is CPU version?
I think your JAX/JAXLIB versions are correct (as long as you follow our installation guide).
When you request for a job, slurm might have trouble finding the right CUDA path. Do you know your administrater who manages that Slurm? Each Slurm has different ways of installing CUDA.
A second way to debug is: instead of asking slurm to run the job, could you ask for an interactive bash session, and try to manually launch in the bash and see if that can help you locate the correct CUDA paths.
@zhisbug
Thanks for reply.
I follow your installation guide so the JAX version is correct. (btw, I didn't see a jaxlib
in my conda list. Should it be installed also?)
Our slurm's CUDA path is /mnt/cache/share/cuda-11.2/
. And I googled which subpath should be add in the LD_LIBRARY_PATH
, although it didn't work. Maybe I should google more.
I also will try the second way you mentioned above. Once I have solved the problem, I will give feedback under this issue.
Thank you again for taking the time to answer my questions
When you do this step 3 of this guide: https://alpa-projects.github.io/install/from_source.html#install-from-source jaxlib is compiled and installed.
If you are testing with a JAX-only env (w/o Alpa), make sure to follow here (https://github.com/google/jax#pip-installation-gpu-cuda) to install the jaxlib.
Jaxlib is required to run jax, regardless of w/ or w/o Alpa. It is the backend of JAX. And you need the CUDA version of JaxLib.
No problem. Feel free to keep up updated with your exploration.
@TarzanZhao Hi~,sorry to bother you. Could you tell me the version of CUDA、cudnn、nccl and maybe openmpi and other tools' version? (Please include as many as possible) It will be helpful for me to tell my slurm cluster administrater which version to install. I plan to tell the adminstrater to install cuda-11.2 and cudnn8.1. Any other point of version should be paid attention to? It will be nice for you replying me. Looking forward for your reply.
@zyc-bit In my AWS environment, I use cuda 11.1
, cudnn 8.1
. Also please make sure your GPU driver version is >= 460.xx.xx
. That's all of the library version constraints I can think of.
@zhuohan123 Thanks a lot for replying. This really helped me. Thank you again.
@zhisbug I solved the problem that can't find gpu I mentioned above. The reason is that when compiling the jaxlib, users like me who uses slurm must compiling on srun
commond. Which means compiling with a gpu.( I previously thought that compiling with just the cpu would be enough, but now it seems that you must use the gpu to do so in slurm. )
This solution above can be used as a reference for other slurm users maybe.
And now I ran into a new problem, maybe this problem is a Ray-related problem. (Please forgive me for asking this question here, because after I googled this question, I didn't find a relevant ray answer.) If this question doesn't fit under this issue, please let me know. If this is not a problem caused by Alpa, I will ask in Ray project or community.
One possibility: it seems that your SRUN is not requesting a sufficient number of CPU threads for ray to work
@zhisbug Thank you for your reply. And I apologize for my untimely reply to you. According to our slurm cluster setup, once a task is sent up to a compute node, it will be able to use all the cpu's of the node. so it shouldn't be a problem with the number of cpu threads. But I will continue to troubleshoot this direction. It could be an issue with the still active process that starts with (raylet) on the picture. What kind of process is it? Is it a ray process? How do I find it?
I found the solution. The error caused by the Proxy of the cluster. So the slurm cluster users should check their own proxy before they using alpa on their slurm cluster. Maybe you can remind people who uses slurm cluster in your doc. I'm finally going to start using alpa, maybe other problems will heppend. I'll open another isuee if I had other problems. So this issue can be closed. Thank you all. @zhisbug @zhuohan123 @TarzanZhao
Hi, I don't know if it's appropriate to file this as a bug, but it's been bugging me for a long time and I have no way to fix it.
I'm operating on a cluster. Ray saw my GPU but alpa didn't. I followed the installation documentation Install Alpa. And I confirmed I used --enable_cuda when I compiled jax-alpa. When running
tests/test_install.py
errors are reported, you can see the error log attached below for more details.System information and environment
conda list
andpip list
show the Alpa version is 0.0.0)To Reproduce I ran
and my test_install.sh is:
Log
and my Environment Variables are: