Closed mathpluscode closed 3 years ago
actually, just need to change the requirement in setup.py
to tensorflow>=2.0
Actually, after testing on DGX and PT, using tf2.0.0
has bug, it couldn't run
train --gpu "" --config_path deepreg/config/unpaired_labeled_ddf.yaml --log_dir test
This might be related to the difference between tf.shape(x)
and x.shape
poch 1/2
2020-07-18 18:51:41.096040: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
1/3 [=========>....................] - ETA: 41s - loss: 0.8558 - loss/regularization: 0.0000e+00 - loss/weighted_regularization: 0.0000e+00 - loss/image_dissimilarity: -0.6766 - loss/weighted_image_dissimilarity: -0.0677 - loss/label_dissimilarity: 0.9235 - loss/weighted_label_dissimilarity: 0.9235 - metric/dice_binary: 0.2295 - metric/dice_float: 0.2295 - metric/tre: 3.2545 - metric/foreground_la2/3 [===================>..........] - ETA: 10s - loss: 0.8929 - loss/regularization: 6.9410e-09 - loss/weighted_regularization: 3.4705e-09 - loss/image_dissimilarity: -0.6864 - loss/weighted_image_dissimilarity: -0.0686 - loss/label_dissimilarity: 0.9615 - loss/weighted_label_dissimilarity: 0.9615 - metric/dice_binary: 0.3647 - metric/dice_float: 0.1147 - metric/tre: 2.5286 - metric/foreground_la3/3 [==============================] - 25s 8s/step - loss: 0.9015 - loss/regularization: 1.7366e-08 - loss/weighted_regularization: 8.6828e-09 - loss/image_dissimilarity: -0.7147 - loss/weighted_image_dissimilarity: -0.0715 - loss/label_dissimilarity: 0.9730 - loss/weighted_label_dissimilarity: 0.9730 - metric/dice_binary: 0.5765 - metric/dice_float: 0.0820 - metric/tre: 1.6857 - metric/foreground_label: 0.0044 - metric/foreground_pred: 0.0126
Traceback (most recent call last):
File "/home/yunguan/miniconda3/envs/deepreg_tf2/bin/train", line 33, in <module>
sys.exit(load_entry_point('deepreg', 'console_scripts', 'train')())
File "/home/yunguan/Git/DeepReg/deepreg/train.py", line 210, in main
args.gpu, args.config_path, args.gpu_allow_growth, args.ckpt_path, args.log_dir
File "/home/yunguan/Git/DeepReg/deepreg/train.py", line 146, in train
callbacks=[tensorboard_callback, checkpoint_callback],
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 370, in fit
total_epochs=1)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(*args, **kwds)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
self._initialize(args, kwds, add_initializers_to=initializer_map)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
*args, **kwds))
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
return weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
per_replica_function, args=(model, x, y, sample_weights))
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 661, in _call_for_each_replica
fn, args, kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
coord.join(threads)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 879, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
return func(*args, **kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 327, in test_on_batch
output_loss_metrics=model._output_loss_metrics)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 354, in test_on_batch
output_loss_metrics=output_loss_metrics))
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 127, in _model_loss
outs = model(inputs, **kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 847, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 708, in call
convert_kwargs_to_constants=base_layer_utils.call_context().saving)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 860, in _run_internal_graph
output_tensors = layer(computed_tensors, **kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 847, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
return func(*args, **kwargs)
File "/home/yunguan/Git/DeepReg/deepreg/model/layer.py", line 359, in call
vol=inputs[1], loc=grid_warped
File "/home/yunguan/Git/DeepReg/deepreg/model/layer_util.py", line 278, in resample
tf.reshape(tf.range(batch_size), [batch_size] + [1] * len(loc_shape)),
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 1422, in range
limit = ops.convert_to_tensor(limit, name="limit")
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1184, in convert_to_tensor
return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1242, in convert_to_tensor_v2
as_ref=False)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1296, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 286, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
allow_broadcast=True)
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 265, in _constant_impl
allow_broadcast=allow_broadcast))
File "/home/yunguan/miniconda3/envs/deepreg_tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_util.py", line 437, in make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
But after the tf.shape fix, the code still don't work with two GPUs under tf2.0.0
2020-07-18 21:37:29.271976: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2020-07-18 21:37:29.273022: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.0
1/1 [==============================] - 28s 28s/step
2020-07-18 21:37:29.647086: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 0 kernel records, 185 memcpy records.
Traceback (most recent call last):
File "/home/yunguan/miniconda3/envs/deepreg/bin/train", line 33, in <module>
sys.exit(load_entry_point('deepreg', 'console_scripts', 'train')())
File "/home/yunguan/Git/DeepReg/deepreg/train.py", line 210, in main
args.gpu, args.config_path, args.gpu_allow_growth, args.ckpt_path, args.log_dir
File "/home/yunguan/Git/DeepReg/deepreg/train.py", line 146, in train
callbacks=[tensorboard_callback, checkpoint_callback],
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
total_epochs=epochs)
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(*args, **kwds)
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
return self._stateless_fn(*args, **kwds)
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
self.captured_inputs)
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/home/yunguan/miniconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: indices[3,15,15,15] = [3, 13, 13, 15] does not index into param shape [2,16,16,16]
[[{{node PartitionedCall/GatherNd_1}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
[[div_no_nan_9/ReadVariableOp_1/_20]]
(1) Invalid argument: indices[3,15,15,15] = [3, 13, 13, 15] does not index into param shape [2,16,16,16]
[[{{node PartitionedCall/GatherNd_1}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
0 successful operations.
Sorry it's still in WIP
Sorry it's still in WIP
add [WIP] in front the PR. Sorry - thought that's easy :)
Sorry it's still in WIP
add [WIP] in front the PR. Sorry - thought that's easy :)
Yes! Sorry, i should have done this.
Maybe let's consider this later after release. Too much work to do for the moment. @YipengHu @NMontanaBrown
As different TF versions have different dependencies on cuda toolkit and cudnn, it might be difficult to support old versions. Therefore I suggest to close this issue. @YipengHu @NMontanaBrown
I close this ticket.
Issue description
Currently, we are using tf2.2, but with GPU support, it will need CUDA 10.1. (check https://www.tensorflow.org/install/source). Although we can solve some nvidia-related packages (cuDNN etc) using conda, conda will not be able to change CUDA version.
Therefore in case the machine has only CUDA 10.0, the user might not want to update it as it might require an update of GPU driver which is dangerous.
Currently, I believe there's not any function which definitely require tensorflow version > 2.0. So it will be nice to provide a version with tf2.0
@NMontanaBrown @YipengHu
Type of Issue