Closed AnuragKr closed 2 years ago
[1,0]:Switch to serial execution due to lack of horovod module.
Can you check import horovod.tensorflow
?
Your horovod may not be built against tensorflow. Please refer horovod's documentation.
I checked import horovod.tensorflow
it's working and I followed all the steps mentioned in the documentation but still I am getting same error.
I am doing it all this in virtual environment hope that is not an issue.
Horovodrun --check-build output -- Horovod v0.24.3:
Available Frameworks: [X] TensorFlow [X] PyTorch [ ] MXNet
Available Controllers: [X] MPI [ ] Gloo
Available Tensor Operations: [X] NCCL [ ] DDL [ ] CCL [X] MPI [ ] Gloo
That's wired. Could you add raise
after the following line? It can help to debug what's the error here.
As I mentioned above I was doing it in a virtual environment now I installed horovod again globally now it is working in virtual environment also.But Now I am getting new error --- Command -- CUDA_VISIBLE_DEVICES=0,1 mpirun -hostfile hostfile -np 2 -x NCCL_DEBUG=INFO dp train --mpi-log=workers input.json Output -- hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Bootstrap : Using enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Failed to open libibverbs.so[.1] hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO NET/Socket : Using [0]enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Using network Socket NCCL version 2.12.12+cuda11.7 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Bootstrap : Using enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hp-HP-Z8-G4-Workstation:198126:198130 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:963 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Failed to open libibverbs.so[.1] hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO NET/Socket : Using [0]enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Using network Socket
hp-HP-Z8-G4-Workstation:198126:198130 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:963 -> 1
hp-HP-Z8-G4-Workstation:198127:198131 [1] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:963 -> 1
It looks like your virtual environment does not install NVIDIA driver?
NVIDIA driver is installed this error come whenever I try to run deepmd-kit with more than 1 process.
This error may come from NCCL, see https://github.com/NVIDIA/nccl/issues/658. Does the solution mentioned in this issue work for you?
Solution -- given by benmenadue unable to understand his solution. If you can help me out what changes do I have to make.
System -- NCCL - 2.12.12 Workstation with 2 GPU CUDA - 11.7 Steps I had done --
@njzjz The link you mentioned I tried that link I was able to run that nccl-test via cudart
(tensorflow) anurag1@hp-HP-Z8-G4-Workstation:/nccl-tests\$ NCCL_DEBUG=WARN LD_LIBRARY_PATH=~/.local/nccl/lib/ ./src/build-shared/all_gather_perf
nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
Using devices
Rank 0 Pid 775733 on hp-HP-Z8-G4-Workstation device 0 [0x15] NVIDIA GeForce RTX 2080 Ti
NCCL version 2.12.12+cuda11.7
out-of-place in-place
size count type time algbw busbw error time algbw busbw error
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float 125.7 266.96 0.00 0e+00 0.90 37470.05 0.00 0e+00
Out of bounds values : 0 OK Avg bus bandwidth : 0
But error still persist Command -- CUDA_VISIBLE_DEVICES=0,1 mpirun -hostfile hostfile -np 2 -x NCCL_DEBUG=INFO dp train --mpi-log=workers input.json
Error Stack Trace --
hp-HP-Z8-G4-Workstation:779166:779172 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:963 -> 1 Traceback (most recent call last): File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call return fn(*args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn return self._call_tf_sessionrun(options, feed_dict, fetch_list, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [[{{node HorovodBroadcast_layer_0_type_1_bias_Adam_1_0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in
Detected at node 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0' defined at (most recent call last):
File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in
Original stack trace for 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0':
File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in
hp-HP-Z8-G4-Workstation:779167:779171 [1] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL hp-HP-Z8-G4-Workstation:779167:779171 [1] NCCL INFO init.cc:1084 -> 4
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[52430,1],1] Exit code: 1
Did you compile NCCL by yourself?
Yes
I suggest you try our conda package to see whether the error comes from the compilation or runtime environments.
conda create -n deepmd horovod nccl cudatoolkit=11.6 -c https://conda.deepmodeling.com
In https://github.com/NVIDIA/nccl/issues/658, sclarkson suggested removing nccl/src/enhcompat.cc
. You may have a try.
@njzjz I tried using conda but still error is same
Output --
[0] DEEPMD rank:0 INFO built training
[0] DEEPMD rank:0 INFO initialize model from scratch
[0] DEEPMD rank:0 INFO broadcast global variables to other tasks
[1] DEEPMD rank:1 INFO built training
[1] DEEPMD rank:1 INFO receive global variables from task#0
[1] Traceback (most recent call last):
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call
[1] return fn(*args)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
[1] return self._call_tf_sessionrun(options, feed_dict, fetch_list,
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
[1] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
[1] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error
[1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][1]
[1]
[1] During handling of the above exception, another exception occurred:
[1]
[1] Traceback (most recent call last):
[1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in
Regarding nccl there is no src folder and no enhcompat.cc as I have installed 2.12.12
I think I need to change system something wrong with the system or some corrupt cuda installation.
I have the same problem.
Training with 1 GPU is fine. Training with 2 GPUs with horovodrun or mpirun results in this error:
[1] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, [1] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][1]
I did a clean re-installation of ubuntu 22.04 and installed only the deepmd 2.1.3 cuda 11.6 conda environment without any other packages. I do not think it is a package conflict problem on my side.
https://github.com/horovod/horovod/issues/3625#issuecomment-1228884495 could resolve this issue temporarily. The original error should be tracked in the upstream repository.
For conda users: a new NCCL package has been uploaded to our conda channel.
@njzjz Pardon me for asking questions on this issue after a long time. Could you please tell me in which file I need to make a change - CUDARTLIB="cuda". I couldn't find any MAKEFILE which consists of this line. If I have followed installation from the source. For conda version -- Could you please provide conda channel link where nccl package was uploaded or you are referring to this one -- conda install -c conda-forge nccl
@AnuragKr The variable can be assigned by make CUDARTLIB="cuda"
.
The conda channel is https://conda.deepmodeling.com
@njzjz Thanks for the prompt response. When I try to run I got the following error -- I think I am missing some steps as it requires Makefile as an input file but I don't have Makefile in the deepmd directory. Please let me know how to make the above changes.
For conda version -- Above link redirects to official website installation page.From where to download nccl package or this command -- conda install -c conda-forge nccl will work.
Makefile is for NCCL - i.e. https://github.com/NVIDIA/nccl/blob/master/Makefile
conda: conda install nccl -c https://conda.deepmodeling.com
Bug summary
There is problem coming in parallel training every time it is falling to serial execution mode.All the packages I have installed correctly as per documentation.
DeePMD-kit Version
2.1.1
TensorFlow Version
2.9.1
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
Command -- CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 \ dp train --mpi-log=workers input.json
Output -- [1,0]:WARNING:tensorflow:From /home/anurag/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. [1,0]:Instructions for updating: [1,0]:non-resource variables are not supported in the long term [1,1]:WARNING:tensorflow:From /home/anurag/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variablescope) is deprecated and will be removed in a future version. [1,1]:Instructions for updating: [1,1]:non-resource variables are not supported in the long term [1,0]:Switch to serial execution due to lack of horovod module. [1,1]:Switch to serial execution due to lack of horovod module. [1,0]:DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) [1,1]:DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) [1,1]:DEEPMD INFO training data with min nbor dist: 0.8854385688525511 [1,1]:DEEPMD INFO training data with max nbor size: [38, 72] [1,1]:DEEPMD INFO ____ [1,1]:DEEPMD INFO | \ | \ | / || \ | | ()| | [1,1]:DEEPMD INFO | | | | _ | |) || \ / || | | | ____ | | _ | | [1,1]:DEEPMD INFO | | | | / \ / | / | |/| || | | |||| |/ /| || | [1,1]:DEEPMD INFO | || || /| /| | | | | || || | | < | || | [1,1]:DEEPMD INFO |/ | ||| || |||/ |||| __| [1,1]:DEEPMD INFO Please read and cite: [1,1]:DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) [1,1]:DEEPMD INFO installed to: /tmp/pip-req-build-pjks4pue/_skbuild/linux-x86_64-3.8/cmake-install [1,1]:DEEPMD INFO source : v2.1.1 [1,1]:DEEPMD INFO source brach: master [1,1]:DEEPMD INFO source commit: https://github.com/deepmodeling/deepmd-kit/commit/c4f0cec0e20bab38579a3a29f1106cbee4a8ecf9 [1,1]:DEEPMD INFO source commit at: 2022-04-16 11:11:16 +0800 [1,1]:DEEPMD INFO build float prec: double [1,1]:DEEPMD INFO build with tf inc: /tmp/pip-build-env-dfkmanfm/normal/lib/python3.8/site-packages/tensorflow/include [1,1]:DEEPMD INFO build with tf lib: [1,1]:DEEPMD INFO ---Summary of the training--------------------------------------- [1,1]:DEEPMD INFO running on: hp-HP-Z8-G4-Workstation [1,1]:DEEPMD INFO computing device: gpu:0 [1,1]:DEEPMD INFO CUDA_VISIBLE_DEVICES: 0,1 [1,1]:DEEPMD INFO Count of visible GPU: 2 [1,1]:DEEPMD INFO num_intra_threads: 6 [1,1]:DEEPMD INFO num_inter_threads: 5 [1,1]:DEEPMD INFO -----------------------------------------------------------------
Steps to Reproduce
GPU Configuration Mon Jun 20 14:14:27 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:15:00.0 Off | N/A | | 30% 34C P8 19W / 250W | 10MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:2D:00.0 Off | N/A | | 30% 40C P8 17W / 250W | 192MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1047 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 1536 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1047 G /usr/lib/xorg/Xorg 35MiB | | 1 N/A N/A 1536 G /usr/lib/xorg/Xorg 113MiB | | 1 N/A N/A 1666 G /usr/bin/gnome-shell 11MiB | | 1 N/A N/A 2009 G ...mviewer/tv_bin/TeamViewer 12MiB | +-----------------------------------------------------------------------------+
Further Information, Files, and Links
No response