DeepRegNet / DeepReg

Medical image registration using deep learning
Apache License 2.0
564 stars 76 forks source link

Docs: Support CUDA 10.0 with tf2.0.0 #179

Closed mathpluscode closed 3 years ago

mathpluscode commented 4 years ago

Issue description

Currently, we are using tf2.2, but with GPU support, it will need CUDA 10.1. (check https://www.tensorflow.org/install/source). Although we can solve some nvidia-related packages (cuDNN etc) using conda, conda will not be able to change CUDA version.

Therefore in case the machine has only CUDA 10.0, the user might not want to update it as it might require an update of GPU driver which is dangerous.

Currently, I believe there's not any function which definitely require tensorflow version > 2.0. So it will be nice to provide a version with tf2.0

@NMontanaBrown @YipengHu

Type of Issue

mathpluscode commented 4 years ago

actually, just need to change the requirement in setup.py to tensorflow>=2.0

mathpluscode commented 4 years ago

Actually, after testing on DGX and PT, using tf2.0.0 has bug, it couldn't run

train --gpu "" --config_path deepreg/config/unpaired_labeled_ddf.yaml --log_dir test

This might be related to the difference between tf.shape(x) and x.shape

poch 1/2
2020-07-18 18:51:41.096040: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
1/3 [=========>....................] - ETA: 41s - loss: 0.8558 - loss/regularization: 0.0000e+00 - loss/weighted_regularization: 0.0000e+00 - loss/image_dissimilarity: -0.6766 - loss/weighted_image_dissimilarity: -0.0677 - loss/label_dissimilarity: 0.9235 - loss/weighted_label_dissimilarity: 0.9235 - metric/dice_binary: 0.2295 - metric/dice_float: 0.2295 - metric/tre: 3.2545 - metric/foreground_la2/3 [===================>..........] - ETA: 10s - loss: 0.8929 - loss/regularization: 6.9410e-09 - loss/weighted_regularization: 3.4705e-09 - loss/image_dissimilarity: -0.6864 - loss/weighted_image_dissimilarity: -0.0686 - loss/label_dissimilarity: 0.9615 - loss/weighted_label_dissimilarity: 0.9615 - metric/dice_binary: 0.3647 - metric/dice_float: 0.1147 - metric/tre: 2.5286 - metric/foreground_la3/3 [==============================] - 25s 8s/step - loss: 0.9015 - loss/regularization: 1.7366e-08 - loss/weighted_regularization: 8.6828e-09 - loss/image_dissimilarity: -0.7147 - loss/weighted_image_dissimilarity: -0.0715 - loss/label_dissimilarity: 0.9730 - loss/weighted_label_dissimilarity: 0.9730 - metric/dice_binary: 0.5765 - metric/dice_float: 0.0820 - metric/tre: 1.6857 - metric/foreground_label: 0.0044 - metric/foreground_pred: 0.0126
Traceback (most recent call last):
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/bin/train", line 33, in <module>
    sys.exit(load_entry_point('deepreg', 'console_scripts', 'train')())
  File "/home/yunguan/Git/DeepReg/deepreg/train.py", line 210, in main
    args.gpu, args.config_path, args.gpu_allow_growth, args.ckpt_path, args.log_dir
  File "/home/yunguan/Git/DeepReg/deepreg/train.py", line 146, in train
    callbacks=[tensorboard_callback, checkpoint_callback],
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 370, in fit
    total_epochs=1)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
    self._initialize(args, kwds, add_initializers_to=initializer_map)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
    *args, **kwds))
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
    per_replica_function, args=(model, x, y, sample_weights))
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 661, in _call_for_each_replica
    fn, args, kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
    coord.join(threads)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 879, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 327, in test_on_batch
    output_loss_metrics=model._output_loss_metrics)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 354, in test_on_batch
    output_loss_metrics=output_loss_metrics))
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 127, in _model_loss
    outs = model(inputs, **kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 847, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 708, in call
    convert_kwargs_to_constants=base_layer_utils.call_context().saving)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 860, in _run_internal_graph
    output_tensors = layer(computed_tensors, **kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 847, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/home/yunguan/Git/DeepReg/deepreg/model/layer.py", line 359, in call
    vol=inputs[1], loc=grid_warped
  File "/home/yunguan/Git/DeepReg/deepreg/model/layer_util.py", line 278, in resample
    tf.reshape(tf.range(batch_size), [batch_size] + [1] * len(loc_shape)),
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 1422, in range
    limit = ops.convert_to_tensor(limit, name="limit")
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1184, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1242, in convert_to_tensor_v2
    as_ref=False)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1296, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 286, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
    allow_broadcast=True)
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 265, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_util.py", line 437, in make_tensor_proto
    raise ValueError("None values not supported.")
ValueError: None values not supported.

But after the tf.shape fix, the code still don't work with two GPUs under tf2.0.0

2020-07-18 21:37:29.271976: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2020-07-18 21:37:29.273022: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.0
1/1 [==============================] - 28s 28s/step
2020-07-18 21:37:29.647086: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 0 kernel records, 185 memcpy records.
Traceback (most recent call last):
  File "/home/yunguan/miniconda3/envs/deepreg/bin/train", line 33, in <module>
    sys.exit(load_entry_point('deepreg', 'console_scripts', 'train')())
  File "/home/yunguan/Git/DeepReg/deepreg/train.py", line 210, in main
    args.gpu, args.config_path, args.gpu_allow_growth, args.ckpt_path, args.log_dir
  File "/home/yunguan/Git/DeepReg/deepreg/train.py", line 146, in train
    callbacks=[tensorboard_callback, checkpoint_callback],
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
    total_epochs=epochs)
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:   indices[3,15,15,15] = [3, 13, 13, 15] does not index into param shape [2,16,16,16]
         [[{{node PartitionedCall/GatherNd_1}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]]
         [[div_no_nan_9/ReadVariableOp_1/_20]]
  (1) Invalid argument:   indices[3,15,15,15] = [3, 13, 13, 15] does not index into param shape [2,16,16,16]
         [[{{node PartitionedCall/GatherNd_1}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]]
0 successful operations.
mathpluscode commented 4 years ago

Sorry it's still in WIP

YipengHu commented 4 years ago

Sorry it's still in WIP

add [WIP] in front the PR. Sorry - thought that's easy :)

mathpluscode commented 4 years ago

Sorry it's still in WIP

add [WIP] in front the PR. Sorry - thought that's easy :)

Yes! Sorry, i should have done this.

mathpluscode commented 4 years ago

Maybe let's consider this later after release. Too much work to do for the moment. @YipengHu @NMontanaBrown

mathpluscode commented 3 years ago

As different TF versions have different dependencies on cuda toolkit and cudnn, it might be difficult to support old versions. Therefore I suggest to close this issue. @YipengHu @NMontanaBrown

YipengHu commented 3 years ago

I close this ticket.