baidu-research / tensorflow-allreduce

Apache License 2.0
375 stars 78 forks source link

mpirun -np 4 python mpi_ops_test.py failed #9

Open abhishekcs10 opened 7 years ago

abhishekcs10 commented 7 years ago

NOTE: Only file GitHub issues for bugs and feature requests. All other topics will be closed.

For general support from the community, see StackOverflow. To make bugs and feature requests more easy to find and organize, we close issues that are deemed out of scope for GitHub Issues and point people to StackOverflow.

For bugs or installation issues, please provide the following information. The more information you provide, the more easily we will be able to offer help and advice.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

No threads available

Environment info

Operating System: Ubuntu 16.04.1 LTS

Installed version of CUDA and cuDNN: CUDA-8.0 and cuDNN - 5.1.10 (please attach the output of ls -l /path/to/cuda/lib/libcud*):

-rw-r--r-- 1 root root 559800 Jan 26 17:10 /usr/local/cuda-8.0/lib64/libcudadevrt.a lrwxrwxrwx 1 root root 16 Jan 26 17:13 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0 lrwxrwxrwx 1 root root 19 Jan 26 17:13 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61 -rw-r--r-- 1 root root 476024 Jan 26 17:10 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61 -rw-r--r-- 1 root root 966166 Jan 26 17:10 /usr/local/cuda-8.0/lib64/libcudart_static.a

If installed from binary pip package, provide:

  1. A link to the pip package you installed:
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)". I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so.5.1.10 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally 0.12.1

If installed from source, provide

  1. The commit hash (git rev-parse HEAD)
  2. The output of bazel version

Build label: 0.4.3-2017-01-24 (@6fc5c53) Build target: bazel-out/local-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar Build time: Tue Jan 24 20:34:16 2017 (1485290056) Build timestamp: 1485290056 Build timestamp as int: 1485290056

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

Running mpi_ops_test.py gives error

mpirun -np 4 python mpi_ops_test.py

FailedPreconditionError (see above for traceback): MPI has not been initialized; use tf.contrib.mpi.Session. [[Node: MPISize = MPISize[_device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

ERROR: test_mpi_allreduce_error (main.MPITests) Test that the allreduce raises an error if different ranks try to

Traceback (most recent call last): File "mpi_ops_test.py", line 162, in test_mpi_allreduce_error rank = session.run(mpi.rank()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call raise type(e)(node_def, op, message) FailedPreconditionError: MPI has not been initialized; use tf.contrib.mpi.Session. [[Node: MPIRank = MPIRank[_device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'MPIRank', defined at: File "mpi_ops_test.py", line 301, in tf.test.main() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/test.py", line 91, in main return _googletest.main() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/googletest.py", line 84, in main benchmark.benchmarks_main(true_main=g_main) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/benchmark.py", line 323, in benchmarks_main true_main() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/googletest.py", line 58, in g_main return unittest_main(*args, kwargs) File "/usr/lib/python2.7/unittest/main.py", line 95, in init self.runTests() File "/usr/lib/python2.7/unittest/main.py", line 232, in runTests self.result = testRunner.run(self.test) File "/usr/lib/python2.7/unittest/runner.py", line 151, in run test(result) return self.run(*args, *kwds) File "/usr/lib/python2.7/unittest/suite.py", line 108, in run test(result) File "/usr/lib/python2.7/unittest/case.py", line 393, in call return self.run(args, kwds) File "/usr/lib/python2.7/unittest/case.py", line 329, in run testMethod() File "mpi_ops_test.py", line 81, in test_mpi_size size = session.run(mpi.size()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/mpi/mpi_ops.py", line 68, in size return MPI_LIB.mpi_size(name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in init self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): MPI has not been initialized; use tf.contrib.mpi.Session. [[Node: MPISize = MPISize[_device="/job:localhost/replica:0/task:0/cpu:0"]()]]

abhishekcs10 commented 7 years ago

can please somebody explain where MPI allreduce is called while running allreduce-test.py?

chengdianxuezi commented 6 years ago

have you run the distribute mpi demo sucessfully?