Closed roclark closed 6 years ago
Hi @roclark, Could you please copy/paste here the command line that you use to run it. Also, could you please provide information on what version of TF you use (it's 1.4, did you get it with pip install ...?). Thanks.
PS Also, please, keep in mind that we've released a new version that's not entirely compatible in terms of command line arguments with previous version (but that's not an issue here).
Ok, confirming this. Again CPU issue that I though I had fixed. Will commit update this weekend.
It's fixed now. Turns out my previous fix partially solved the problem.
Thanks again for the help @sergey-serebryakov! For the sake of completion, here is the command that I was running:
python python/dlbs/experimenter.py run -Pexp.framework='"tensorflow"' -Pexp.phase='"training"' -Vexp.model='["resnet50"]' -Vexp.device_batch='"16"' -Pexp.log_file='"./tensorflow/lustre/${exp.model}/cpu.log"' -Pexp.device='"cpu"'
As for TensorFlow, yes, I installed using Pip:
$ pip show tensorflow
Name: tensorflow
Version: 1.4.1
Summary: TensorFlow helps the tensors flow
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /usr/lib64/python2.7/site-packages
Requires: enum34, protobuf, six, wheel, backports.weakref, numpy, tensorflow-tensorboard, mock
I saw that you created a substantial update a week or so ago that changes a lot of the command line arguments as you mention. I'm working on being able to use the new updates in a Docker-less environment at the moment (in parallel I'm also working on creating a Dockerfile that connects to various POSIX filesystems with specific client software - once running, this is what I intend to use) and sorting out some kinks. Do the new updates still support bare metal deployments? I admit that I haven't dived too deep yet, but I didn't see a way to run without Docker using the latest updates.
Thanks for the quick updates! They are all much appreciated!
Unfortunately it looks like I took a step backwards:
BenchmarkCNN::__init__ time=0.116825 ms
TensorFlow: 1.4
Model: resnet50
Mode: training
Batch size: 16 global
16 per device
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: replicated
Use NCCL: True
==========
__exp.model_title__="ResNet50"
Generating model
Adding preprocessing for resnet50
Reshaped input to (16, 2048) with self.top_size = 2048
Added final fully connected layer with logits shape (16, 1001)
Adding sparse softmax cross entropy with logits with logits shape (16, 1001) and labels shape (16,)
Traceback (most recent call last):
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1472, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1468, in main
bench.run()
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1000, in run
self._benchmark_cnn()
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1091, in _benchmark_cnn
start_standard_services=FLAGS.summary_verbosity > 0) as sess:
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
start_standard_services=start_standard_services)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels:
<no registered kernels>
[[Node: NcclAllReduce_160 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c160", _device="/device:GPU:0"](v0/tower_0/gradients/AddN)]]
Caused by op u'NcclAllReduce_160', defined at:
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1472, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1468, in main
bench.run()
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1000, in run
self._benchmark_cnn()
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1040, in _benchmark_cnn
(enqueue_ops, fetches) = self._build_model()
File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1267, in _build_model
self.variable_mgr.preprocess_device_grads(device_grads))
File "/workspace/tf_cnn_benchmarks/variable_mgr.py", line 468, in preprocess_device_grads
device_grads, self.benchmark_cnn.devices)
File "/workspace/tf_cnn_benchmarks/variable_mgr.py", line 661, in sum_gradients_all_reduce
new_tower_grads.append(sum_grad_and_var_all_reduce(grad_and_vars, devices))
File "/workspace/tf_cnn_benchmarks/variable_mgr.py", line 649, in sum_grad_and_var_all_reduce
summed_grads = nccl.all_sum(scaled_grads)
File "/usr/lib/python2.7/site-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum
return _apply_all_reduce('sum', tensors)
File "/usr/lib/python2.7/site-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 208, in _apply_all_reduce
shared_name=shared_name))
File "/usr/lib/python2.7/site-packages/tensorflow/contrib/nccl/ops/gen_nccl_ops.py", line 54, in nccl_all_reduce
num_devices=num_devices, shared_name=shared_name, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels:
<no registered kernels>
[[Node: NcclAllReduce_160 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c160", _device="/device:GPU:0"](v0/tower_0/gradients/AddN)]]
__results.end_time__= "2018-01-22:20:20:46:075"
__results.proc_pid__= 50804
Is the NVIDIA Collective Communications Library required even for CPU-only environments?
HI @roclark.
Recent changes affected parameter names. General parameters are describe here. In particular, exp.device
is now exp.device_type
. So, please, try to run this:
python python/dlbs/experimenter.py run -Pexp.framework='"tensorflow"' -Pexp.phase='"training"' -Vexp.model='["resnet50"]' -Vexp.device_batch='"16"' -Pexp.log_file='"./tensorflow/lustre/${exp.model}/cpu.log"' -Pexp.device_type='"cpu"' -Pexp.docker=false
Also, I added -Pexp.docker=false
to make sure it's using bare metal TensorFlow.
What happens in your case I think is the following. Your parameter exp.device
is not taken into account. The device type is computed by default based on GPUs provided by user. If it's empty string, device type is CPU, else GPU. Default value for GPU devices is '0' (use GPU #0). So, you basically ran TF in GPU mode.
I am trying to add more informative messages - this work is still in progres. Sorry for this confusion. The tutorial folder should contain up to date examples
Yeah, I forgot to double check all of the parameters again, so I missed that. Everything is back to normal now and I can run ResNet* models using the latest update. Thanks again!
Ok, that's good. Thanks for figuring out all these bugs!
Hello again! I'm having difficulties running ResNet models at the moment. No matter which model I use, I always get a
ValueError
that the dimensions must be equal, but I can't quite track down where the discrepancy is coming from, with the exception that some of the convolutions are giving results of different dimensions. Not sure if you guys have run into this before. Here is the end of the output of my log: