deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.49k stars 510 forks source link

[BUG] _Parallel_training_using_horovodrun_not_working #1774

Closed AnuragKr closed 2 years ago

AnuragKr commented 2 years ago

Bug summary

There is problem coming in parallel training every time it is falling to serial execution mode.All the packages I have installed correctly as per documentation.

DeePMD-kit Version

2.1.1

TensorFlow Version

2.9.1

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Command -- CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 \ dp train --mpi-log=workers input.json

Output -- [1,0]:WARNING:tensorflow:From /home/anurag/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. [1,0]:Instructions for updating: [1,0]:non-resource variables are not supported in the long term [1,1]:WARNING:tensorflow:From /home/anurag/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variablescope) is deprecated and will be removed in a future version. [1,1]:Instructions for updating: [1,1]:non-resource variables are not supported in the long term [1,0]:Switch to serial execution due to lack of horovod module. [1,1]:Switch to serial execution due to lack of horovod module. [1,0]:DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) [1,1]:DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) [1,1]:DEEPMD INFO training data with min nbor dist: 0.8854385688525511 [1,1]:DEEPMD INFO training data with max nbor size: [38, 72] [1,1]:DEEPMD INFO ____ [1,1]:DEEPMD INFO | \ | \ | / || \ | | ()| | [1,1]:DEEPMD INFO | | | | _ | |) || \ / || | | | ____ | | _ | | [1,1]:DEEPMD INFO | | | | / \ / | / | |/| || | | |||| |/ /| || | [1,1]:DEEPMD INFO | || || /| /| | | | | || || | | < | || | [1,1]:DEEPMD INFO |/ | ||| || |||/ |||| __| [1,1]:DEEPMD INFO Please read and cite: [1,1]:DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) [1,1]:DEEPMD INFO installed to: /tmp/pip-req-build-pjks4pue/_skbuild/linux-x86_64-3.8/cmake-install [1,1]:DEEPMD INFO source : v2.1.1 [1,1]:DEEPMD INFO source brach: master [1,1]:DEEPMD INFO source commit: https://github.com/deepmodeling/deepmd-kit/commit/c4f0cec0e20bab38579a3a29f1106cbee4a8ecf9 [1,1]:DEEPMD INFO source commit at: 2022-04-16 11:11:16 +0800 [1,1]:DEEPMD INFO build float prec: double [1,1]:DEEPMD INFO build with tf inc: /tmp/pip-build-env-dfkmanfm/normal/lib/python3.8/site-packages/tensorflow/include [1,1]:DEEPMD INFO build with tf lib: [1,1]:DEEPMD INFO ---Summary of the training--------------------------------------- [1,1]:DEEPMD INFO running on: hp-HP-Z8-G4-Workstation [1,1]:DEEPMD INFO computing device: gpu:0 [1,1]:DEEPMD INFO CUDA_VISIBLE_DEVICES: 0,1 [1,1]:DEEPMD INFO Count of visible GPU: 2 [1,1]:DEEPMD INFO num_intra_threads: 6 [1,1]:DEEPMD INFO num_inter_threads: 5 [1,1]:DEEPMD INFO -----------------------------------------------------------------

Steps to Reproduce

  1. Go to the Dir - deepmd-kit/examples/water/se_e2_a
  2. Run command - CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 \ dp train --mpi-log=workers input.json

GPU Configuration Mon Jun 20 14:14:27 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:15:00.0 Off | N/A | | 30% 34C P8 19W / 250W | 10MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:2D:00.0 Off | N/A | | 30% 40C P8 17W / 250W | 192MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1047 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 1536 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1047 G /usr/lib/xorg/Xorg 35MiB | | 1 N/A N/A 1536 G /usr/lib/xorg/Xorg 113MiB | | 1 N/A N/A 1666 G /usr/bin/gnome-shell 11MiB | | 1 N/A N/A 2009 G ...mviewer/tv_bin/TeamViewer 12MiB | +-----------------------------------------------------------------------------+

Further Information, Files, and Links

No response

njzjz commented 2 years ago

[1,0]:Switch to serial execution due to lack of horovod module.

Can you check import horovod.tensorflow?

Your horovod may not be built against tensorflow. Please refer horovod's documentation.

AnuragKr commented 2 years ago

I checked import horovod.tensorflow it's working and I followed all the steps mentioned in the documentation but still I am getting same error. I am doing it all this in virtual environment hope that is not an issue.

Horovodrun --check-build output -- Horovod v0.24.3:

Available Frameworks: [X] TensorFlow [X] PyTorch [ ] MXNet

Available Controllers: [X] MPI [ ] Gloo

Available Tensor Operations: [X] NCCL [ ] DDL [ ] CCL [X] MPI [ ] Gloo

njzjz commented 2 years ago

That's wired. Could you add raise after the following line? It can help to debug what's the error here.

https://github.com/deepmodeling/deepmd-kit/blob/c4f0cec0e20bab38579a3a29f1106cbee4a8ecf9/deepmd/train/run_options.py#L183

AnuragKr commented 2 years ago

As I mentioned above I was doing it in a virtual environment now I installed horovod again globally now it is working in virtual environment also.But Now I am getting new error --- Command -- CUDA_VISIBLE_DEVICES=0,1 mpirun -hostfile hostfile -np 2 -x NCCL_DEBUG=INFO dp train --mpi-log=workers input.json Output -- hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Bootstrap : Using enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Failed to open libibverbs.so[.1] hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO NET/Socket : Using [0]enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Using network Socket NCCL version 2.12.12+cuda11.7 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Bootstrap : Using enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

hp-HP-Z8-G4-Workstation:198126:198130 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:963 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Failed to open libibverbs.so[.1] hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO NET/Socket : Using [0]enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Using network Socket

hp-HP-Z8-G4-Workstation:198126:198130 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:963 -> 1

hp-HP-Z8-G4-Workstation:198127:198131 [1] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:963 -> 1

njzjz commented 2 years ago

It looks like your virtual environment does not install NVIDIA driver?

AnuragKr commented 2 years ago

NVIDIA driver is installed this error come whenever I try to run deepmd-kit with more than 1 process.

njzjz commented 2 years ago

This error may come from NCCL, see https://github.com/NVIDIA/nccl/issues/658. Does the solution mentioned in this issue work for you?

AnuragKr commented 2 years ago

Solution -- given by benmenadue unable to understand his solution. If you can help me out what changes do I have to make.

System -- NCCL - 2.12.12 Workstation with 2 GPU CUDA - 11.7 Steps I had done --

  1. anurag1@hp-HP-Z8-G4-Workstation:~/.local/nccl$ objdump -p lib/libnccl.so.2.12.12 | grep NEEDED NEEDED libpthread.so.0 NEEDED librt.so.1 NEEDED libdl.so.2 NEEDED libstdc++.so.6 NEEDED libm.so.6 NEEDED libgcc_s.so.1 NEEDED libc.so.6 NEEDED ld-linux-x86-64.so.2 It doesn't require libcudart
  2. I ran nccl-test as mentioned here nccl-test it worked but when I tried to run nccl-test with CUDART as mentioned in above link I got -- ./build/all_gather_perf: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory. For resolving this I had given path explicitly to libcudart.so.11.0 but still not working so I need to copy that file to lib64.
AnuragKr commented 2 years ago

@njzjz The link you mentioned I tried that link I was able to run that nccl-test via cudart (tensorflow) anurag1@hp-HP-Z8-G4-Workstation:/nccl-tests\$ NCCL_DEBUG=WARN LD_LIBRARY_PATH=~/.local/nccl/lib/ ./src/build-shared/all_gather_perf
nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 Using devices Rank 0 Pid 775733 on hp-HP-Z8-G4-Workstation device 0 [0x15] NVIDIA GeForce RTX 2080 Ti NCCL version 2.12.12+cuda11.7

                                           out-of-place                       in-place          
   size         count      type     time   algbw   busbw  error     time   algbw   busbw  error
    (B)    (elements)               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
33554432       8388608     float    125.7  266.96    0.00  0e+00     0.90  37470.05    0.00  0e+00

Out of bounds values : 0 OK Avg bus bandwidth : 0

But error still persist Command -- CUDA_VISIBLE_DEVICES=0,1 mpirun -hostfile hostfile -np 2 -x NCCL_DEBUG=INFO dp train --mpi-log=workers input.json

Error Stack Trace --

hp-HP-Z8-G4-Workstation:779166:779172 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:963 -> 1 Traceback (most recent call last): File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call return fn(*args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn return self._call_tf_sessionrun(options, feed_dict, fetch_list, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [[{{node HorovodBroadcast_layer_0_type_1_bias_Adam_1_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in sys.exit(main()) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main train_dp(*dict_args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train _do_work(jdata, run_opt, is_compress) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work model.train(train_data, valid_data) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train self._init_session() File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 435, in _init_session run_sess(self.sess, bcast_op) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/utils/sess.py", line 21, in run_sess return sess.run(args, **kwargs) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 967, in run result = self._run(None, fetches, feed_dict, options_ptr, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1190, in _run results = self._do_run(handle, final_targets, final_fetches, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run return self._do_call(_run_fn, feeds, fetches, targets, options, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0' defined at (most recent call last): File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in sys.exit(main()) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main train_dp(*dict_args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train _do_work(jdata, run_opt, is_compress) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work model.train(train_data, valid_data) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train self._init_session() File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 430, in _init_session bcast_op = self.run_opt._HVD.broadcast_global_variables(0) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/init.py", line 339, in broadcast_global_variables return broadcast_variables(_global_variables(), root_rank) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables return broadcast_group(variables, root_rank, process_set) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in return tf.group(*[var.assign(broadcast(var, root_rank, process_set=process_set)) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, File "", line 515, in horovod_broadcast Node: 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0' ncclCommInitRank failed: unhandled cuda error [[{{node HorovodBroadcast_layer_0_type_1_bias_Adam_1_0}}]]

Original stack trace for 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0': File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in sys.exit(main()) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main train_dp(*dict_args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train _do_work(jdata, run_opt, is_compress) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work model.train(train_data, valid_data) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train self._init_session() File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 430, in _init_session bcast_op = self.run_opt._HVD.broadcast_global_variables(0) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/init.py", line 339, in broadcast_global_variables return broadcast_variables(_global_variables(), root_rank) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables return broadcast_group(variables, root_rank, process_set) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in return tf.group(*[var.assign(broadcast(var, root_rank, process_set=process_set)) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, File "", line 515, in horovod_broadcast File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal ret = Operation( File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 2133, in init self._traceback = tf_stack.extract_stack_for_node(self._c_op)

hp-HP-Z8-G4-Workstation:779167:779171 [1] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL hp-HP-Z8-G4-Workstation:779167:779171 [1] NCCL INFO init.cc:1084 -> 4

hp-HP-Z8-G4-Workstation:779166:779172 [0] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:1084 -> 4

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[52430,1],1] Exit code: 1

njzjz commented 2 years ago

Did you compile NCCL by yourself?

AnuragKr commented 2 years ago

Yes

njzjz commented 2 years ago

I suggest you try our conda package to see whether the error comes from the compilation or runtime environments.

conda create -n deepmd horovod nccl cudatoolkit=11.6 -c https://conda.deepmodeling.com

In https://github.com/NVIDIA/nccl/issues/658, sclarkson suggested removing nccl/src/enhcompat.cc. You may have a try.

AnuragKr commented 2 years ago

@njzjz I tried using conda but still error is same Output -- [0] DEEPMD rank:0 INFO built training [0] DEEPMD rank:0 INFO initialize model from scratch [0] DEEPMD rank:0 INFO broadcast global variables to other tasks [1] DEEPMD rank:1 INFO built training [1] DEEPMD rank:1 INFO receive global variables from task#0 [1] Traceback (most recent call last): [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call [1] return fn(*args) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn [1] return self._call_tf_sessionrun(options, feed_dict, fetch_list, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun [1] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, [1] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][1] [1] [1] During handling of the above exception, another exception occurred: [1] [1] Traceback (most recent call last): [1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [1] sys.exit(main()) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [1] train_dp(dict_args) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [1] _do_work(jdata, run_opt, is_compress) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [1] model.train(train_data, valid_data) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [1] self._init_session() [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 445, in _init_session [1] run_sess(self.sess, bcast_op) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/sess.py", line 21, in run_sess [1] return sess.run(args, kwargs) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 967, in run [1] result = self._run(None, fetches, feed_dict, options_ptr, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1190, in _run [1] results = self._do_run(handle, final_targets, final_fetches, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run [1] return self._do_call(_run_fn, feeds, fetches, targets, options, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call [1] raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter [1] tensorflow.python.framework.errors_impl.UnknownError: Graph execution error: [1] [1] Detected at node 'HorovodBroadcast_filter_type_0_matrix_3_0_0' defined at (most recent call last): [1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [1] sys.exit(main()) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [1] train_dp(dict_args) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [1] _do_work(jdata, run_opt, is_compress) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [1] model.train(train_data, valid_data) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [1] self._init_session() [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session [1] bcast_op = self.run_opt._HVD.broadcast_global_variables(0) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables [1] return broadcast_variables(_global_variables(), root_rank) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables [1] return broadcast_group(variables, root_rank, process_set) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group [1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in [1] return tf.group(*[var.assign(broadcast(var, root_rank, process_set=process_set)) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast [1] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, [1] File "", line 515, in horovod_broadcast [1] Node: 'HorovodBroadcast_filter_type_0_matrix_3_0_0' [1] ncclCommInitRank failed: unhandled cuda error [1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]] [1] [1] Original stack trace for 'HorovodBroadcast_filter_type_0_matrix_3_0_0': [1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [1] sys.exit(main()) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [1] train_dp(*dict_args) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [1] _do_work(jdata, run_opt, is_compress) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [1] model.train(train_data, valid_data) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [1] self._init_session() [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session [1] bcast_op = self.run_opt._HVD.broadcast_global_variables(0) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables [1] return broadcast_variables(_global_variables(), root_rank) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables [1] return broadcast_group(variables, root_rank, process_set) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group [1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in [1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast [1] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, [1] File "", line 515, in horovod_broadcast [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper [1] op = g._create_op_internal(op_type_name, inputs, dtypes=None, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal [1] ret = Operation( [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2133, in init [1] self._traceback = tf_stack.extract_stack_for_node(self._c_op) [1] [0] Traceback (most recent call last): [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call [0] return fn(args) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn [0] return self._call_tf_sessionrun(options, feed_dict, fetch_list, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun [0] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, [0] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [0] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][0] [0] [0] During handling of the above exception, another exception occurred: [0] [0] Traceback (most recent call last): [0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [0] sys.exit(main()) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [0] train_dp(dict_args) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [0] _do_work(jdata, run_opt, is_compress) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [0] model.train(train_data, valid_data) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [0] self._init_session() [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 445, in _init_session [0] run_sess(self.sess, bcast_op) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/sess.py", line 21, in run_sess [0] return sess.run(args, kwargs) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 967, in run [0] result = self._run(None, fetches, feed_dict, options_ptr, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1190, in _run [0] results = self._do_run(handle, final_targets, final_fetches, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run [0] return self._do_call(_run_fn, feeds, fetches, targets, options, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call [0] raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter [0] tensorflow.python.framework.errors_impl.UnknownError: Graph execution error: [0] [0] Detected at node 'HorovodBroadcast_filter_type_0_matrix_3_0_0' defined at (most recent call last): [0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [0] sys.exit(main()) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [0] train_dp(dict_args) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [0] _do_work(jdata, run_opt, is_compress) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [0] model.train(train_data, valid_data) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [0] self._init_session() [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session [0] bcast_op = self.run_opt._HVD.broadcast_global_variables(0) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables [0] return broadcast_variables(_global_variables(), root_rank) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables [0] return broadcast_group(variables, root_rank, process_set) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group [0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in [0] return tf.group(*[var.assign(broadcast(var, root_rank, process_set=process_set)) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast [0] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, [0] File "", line 515, in horovod_broadcast [0] Node: 'HorovodBroadcast_filter_type_0_matrix_3_0_0' [0] ncclCommInitRank failed: unhandled cuda error [0] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]] [0] [0] Original stack trace for 'HorovodBroadcast_filter_type_0_matrix_3_0_0': [0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [0] sys.exit(main()) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [0] train_dp(*dict_args) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [0] _do_work(jdata, run_opt, is_compress) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [0] model.train(train_data, valid_data) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [0] self._init_session() [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session [0] bcast_op = self.run_opt._HVD.broadcast_global_variables(0) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables [0] return broadcast_variables(_global_variables(), root_rank) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables [0] return broadcast_group(variables, root_rank, process_set) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group [0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in [0] return tf.group(*[var.assign(broadcast(var, root_rank, process_set=process_set)) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast [0] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, [0] File "", line 515, in horovod_broadcast [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper [0] op = g._create_op_internal(op_type_name, inputs, dtypes=None, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal [0] ret = Operation( [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2133, in init [0] self._traceback = tf_stack.extract_stack_for_node(self._c_op)

Regarding nccl there is no src folder and no enhcompat.cc as I have installed 2.12.12

I think I need to change system something wrong with the system or some corrupt cuda installation.

Lewis-YL commented 2 years ago

I have the same problem.

Training with 1 GPU is fine. Training with 2 GPUs with horovodrun or mpirun results in this error:

[1] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, [1] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][1]

I did a clean re-installation of ubuntu 22.04 and installed only the deepmd 2.1.3 cuda 11.6 conda environment without any other packages. I do not think it is a package conflict problem on my side.

njzjz commented 2 years ago

https://github.com/horovod/horovod/issues/3625#issuecomment-1228884495 could resolve this issue temporarily. The original error should be tracked in the upstream repository.

For conda users: a new NCCL package has been uploaded to our conda channel.

AnuragKr commented 2 years ago

@njzjz Pardon me for asking questions on this issue after a long time. Could you please tell me in which file I need to make a change - CUDARTLIB="cuda". I couldn't find any MAKEFILE which consists of this line. If I have followed installation from the source. For conda version -- Could you please provide conda channel link where nccl package was uploaded or you are referring to this one -- conda install -c conda-forge nccl

njzjz commented 2 years ago

@AnuragKr The variable can be assigned by make CUDARTLIB="cuda".

The conda channel is https://conda.deepmodeling.com

AnuragKr commented 2 years ago

@njzjz Thanks for the prompt response. When I try to run I got the following error -- make_error I think I am missing some steps as it requires Makefile as an input file but I don't have Makefile in the deepmd directory. Please let me know how to make the above changes.

For conda version -- Above link redirects to official website installation page.From where to download nccl package or this command -- conda install -c conda-forge nccl will work.

njzjz commented 2 years ago

Makefile is for NCCL - i.e. https://github.com/NVIDIA/nccl/blob/master/Makefile

conda: conda install nccl -c https://conda.deepmodeling.com