HewlettPackard / dlcookbook-dlbs

Deep Learning Benchmarking Suite
https://www.hpe.com/software/dl-cookbook
Apache License 2.0
130 stars 50 forks source link

Issue running ResNet models #2

Closed roclark closed 6 years ago

roclark commented 6 years ago

Hello again! I'm having difficulties running ResNet models at the moment. No matter which model I use, I always get a ValueError that the dimensions must be equal, but I can't quite track down where the discrepancy is coming from, with the exception that some of the convolutions are giving results of different dimensions. Not sure if you guys have run into this before. Here is the end of the output of my log:

BenchmarkCNN::__init__ time=0.061035 ms
TensorFlow:  1.4
Model:       resnet50
Mode:        training
Batch size:  16 global
             16 per device
Devices:     ['/cpu:0']
Data format: NHWC
Optimizer:   sgd
Variables:   replicated
Use NCCL:    False
==========
__exp.model_title__="ResNet50"
Generating model
Adding preprocessing for resnet50
Traceback (most recent call last):
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1454, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1450, in main
    bench.run()
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 995, in run
    self._benchmark_cnn()
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1035, in _benchmark_cnn
    (enqueue_ops, fetches) = self._build_model()
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1208, in _build_model
    gpu_grad_stage_ops)
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1358, in add_forward_pass_and_gradients
    self.model_conf.add_inference(network)
  File "/root/dlbs/python/tf_cnn_benchmarks/resnet_model.py", line 78, in add_inference
    dim_match=False, bottle_neck=bottle_neck)
  File "/root/dlbs/python/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 527, in residual_unit
    self.top_layer = tf.nn.relu(shortcut + res)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 183, in add
    "Add", x=x, y=y, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2958, in create_op
    set_shapes_for_outputs(ret)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2209, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2159, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 627, in call_cpp_shape_fn
    require_shape_fn)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 691, in _call_cpp_shape_fn_impl
    raise ValueError(err.message)
ValueError: Dimensions must be equal, but are 54 and 52 for 'v0/tower_0/add' (op: 'Add') with input shapes: [16,54,55,256], [16,52,55,256].
__results.end_time__= "2018-01-19:16:21:20:504"
__results.proc_pid__= 5009
sergey-serebryakov commented 6 years ago

Hi @roclark, Could you please copy/paste here the command line that you use to run it. Also, could you please provide information on what version of TF you use (it's 1.4, did you get it with pip install ...?). Thanks.

PS Also, please, keep in mind that we've released a new version that's not entirely compatible in terms of command line arguments with previous version (but that's not an issue here).

sergey-serebryakov commented 6 years ago

Ok, confirming this. Again CPU issue that I though I had fixed. Will commit update this weekend.

sergey-serebryakov commented 6 years ago

It's fixed now. Turns out my previous fix partially solved the problem.

roclark commented 6 years ago

Thanks again for the help @sergey-serebryakov! For the sake of completion, here is the command that I was running:

python python/dlbs/experimenter.py run -Pexp.framework='"tensorflow"' -Pexp.phase='"training"' -Vexp.model='["resnet50"]' -Vexp.device_batch='"16"' -Pexp.log_file='"./tensorflow/lustre/${exp.model}/cpu.log"' -Pexp.device='"cpu"'

As for TensorFlow, yes, I installed using Pip:

$ pip show tensorflow
Name: tensorflow
Version: 1.4.1
Summary: TensorFlow helps the tensors flow
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /usr/lib64/python2.7/site-packages
Requires: enum34, protobuf, six, wheel, backports.weakref, numpy, tensorflow-tensorboard, mock

I saw that you created a substantial update a week or so ago that changes a lot of the command line arguments as you mention. I'm working on being able to use the new updates in a Docker-less environment at the moment (in parallel I'm also working on creating a Dockerfile that connects to various POSIX filesystems with specific client software - once running, this is what I intend to use) and sorting out some kinks. Do the new updates still support bare metal deployments? I admit that I haven't dived too deep yet, but I didn't see a way to run without Docker using the latest updates.

Thanks for the quick updates! They are all much appreciated!

roclark commented 6 years ago

Unfortunately it looks like I took a step backwards:

BenchmarkCNN::__init__ time=0.116825 ms
TensorFlow:  1.4
Model:       resnet50
Mode:        training
Batch size:  16 global
             16 per device
Devices:     ['/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   replicated
Use NCCL:    True
==========
__exp.model_title__="ResNet50"
Generating model
Adding preprocessing for resnet50
Reshaped input to (16, 2048) with self.top_size = 2048
Added final fully connected layer with logits shape (16, 1001)
Adding sparse softmax cross entropy with logits with logits shape (16, 1001) and labels shape (16,)
Traceback (most recent call last):
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1472, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1468, in main
    bench.run()
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1000, in run
    self._benchmark_cnn()
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1091, in _benchmark_cnn
    start_standard_services=FLAGS.summary_verbosity > 0) as sess:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs.  Registered devices: [CPU], Registered kernels:
  <no registered kernels>

         [[Node: NcclAllReduce_160 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c160", _device="/device:GPU:0"](v0/tower_0/gradients/AddN)]]

Caused by op u'NcclAllReduce_160', defined at:
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1472, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1468, in main
    bench.run()
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1000, in run
    self._benchmark_cnn()
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1040, in _benchmark_cnn
    (enqueue_ops, fetches) = self._build_model()
  File "/workspace/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1267, in _build_model
    self.variable_mgr.preprocess_device_grads(device_grads))
  File "/workspace/tf_cnn_benchmarks/variable_mgr.py", line 468, in preprocess_device_grads
    device_grads, self.benchmark_cnn.devices)
  File "/workspace/tf_cnn_benchmarks/variable_mgr.py", line 661, in sum_gradients_all_reduce
    new_tower_grads.append(sum_grad_and_var_all_reduce(grad_and_vars, devices))
  File "/workspace/tf_cnn_benchmarks/variable_mgr.py", line 649, in sum_grad_and_var_all_reduce
    summed_grads = nccl.all_sum(scaled_grads)
  File "/usr/lib/python2.7/site-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum
    return _apply_all_reduce('sum', tensors)
  File "/usr/lib/python2.7/site-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 208, in _apply_all_reduce
    shared_name=shared_name))
  File "/usr/lib/python2.7/site-packages/tensorflow/contrib/nccl/ops/gen_nccl_ops.py", line 54, in nccl_all_reduce
    num_devices=num_devices, shared_name=shared_name, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs.  Registered devices: [CPU], Registered kernels:
  <no registered kernels>

         [[Node: NcclAllReduce_160 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c160", _device="/device:GPU:0"](v0/tower_0/gradients/AddN)]]

__results.end_time__= "2018-01-22:20:20:46:075"
__results.proc_pid__= 50804

Is the NVIDIA Collective Communications Library required even for CPU-only environments?

sergey-serebryakov commented 6 years ago

HI @roclark. Recent changes affected parameter names. General parameters are describe here. In particular, exp.device is now exp.device_type. So, please, try to run this:

python python/dlbs/experimenter.py run -Pexp.framework='"tensorflow"' -Pexp.phase='"training"' -Vexp.model='["resnet50"]' -Vexp.device_batch='"16"' -Pexp.log_file='"./tensorflow/lustre/${exp.model}/cpu.log"' -Pexp.device_type='"cpu"' -Pexp.docker=false

Also, I added -Pexp.docker=false to make sure it's using bare metal TensorFlow.

What happens in your case I think is the following. Your parameter exp.device is not taken into account. The device type is computed by default based on GPUs provided by user. If it's empty string, device type is CPU, else GPU. Default value for GPU devices is '0' (use GPU #0). So, you basically ran TF in GPU mode.

I am trying to add more informative messages - this work is still in progres. Sorry for this confusion. The tutorial folder should contain up to date examples

roclark commented 6 years ago

Yeah, I forgot to double check all of the parameters again, so I missed that. Everything is back to normal now and I can run ResNet* models using the latest update. Thanks again!

sergey-serebryakov commented 6 years ago

Ok, that's good. Thanks for figuring out all these bugs!