AmpX-AI / tft-speedup

Speeding up Google's Temporal Fusion Transformer
44 stars 15 forks source link

Error occurs in Colab #3

Closed greatwhiz closed 3 years ago

greatwhiz commented 3 years ago

I am running it in a free Colab with T4 gpu. It seems that the TF version is 2.5 and I had tried to downgrade it to 2.4.1. Both of the versions result in the same.

The error as below:

Cloning into 'tft-speedup'... remote: Enumerating objects: 16, done. remote: Counting objects: 100% (16/16), done. remote: Compressing objects: 100% (14/14), done. remote: Total 16 (delta 0), reused 13 (delta 0), pack-reused 0 Unpacking objects: 100% (16/16), done. /content/tft-speedup/tft-speedup/tft-speedup 2021-05-25 22:13:33.712671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-05-25 22:13:35.803273: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-05-25 22:13:35.805745: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-05-25 22:13:35.865853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:35.866998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-05-25 22:13:35.867045: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-05-25 22:13:35.893599: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-05-25 22:13:35.893684: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-05-25 22:13:35.995240: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-05-25 22:13:36.094845: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-05-25 22:13:36.299122: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-05-25 22:13:36.390671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-05-25 22:13:36.391146: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-05-25 22:13:36.391286: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:36.391943: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:36.392481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 example.py:135: RuntimeWarning: divide by zero encountered in remainder for s_dependency, dependent, i in zip( 2021-05-25 22:13:36.457705: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-05-25 22:13:36.457860: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-05-25 22:13:36.458009: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:36.458660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s 2021-05-25 22:13:36.458703: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-05-25 22:13:36.458752: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-05-25 22:13:36.458776: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-05-25 22:13:36.458795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-05-25 22:13:36.458816: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-05-25 22:13:36.458849: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-05-25 22:13:36.458867: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-05-25 22:13:36.458886: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-05-25 22:13:36.458961: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:36.459589: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:36.460124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-05-25 22:13:36.460187: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-05-25 22:13:37.208926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-05-25 22:13:37.208994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-05-25 22:13:37.209010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-05-25 22:13:37.209205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:37.209931: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:37.210502: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-05-25 22:13:37.211023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13968 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5) Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(*args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1677, in concat return gen_array_ops.concat_v2(values=values, axis=axis, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1198, in concat_v2 values, axis, name=name, ctx=_ctx) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1228, in concat_v2_eager_fallback _attr_T, values = _execute.args_to_matching_eager(list(values), ctx, []) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 274, in args_to_matching_eager t, dtype, preferred_dtype=default_dtype, ctx=ctx) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped return func(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1540, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py", line 339, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py", line 265, in constant allow_broadcast=True) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py", line 276, in _constant_impl return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py", line 301, in _constant_eager_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py", line 98, in convert_to_eager_tensor return ops.EagerTensor(value, ctx.device_name, dtype) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/keras_tensor.py", line 274, in array 'Cannot convert a symbolic Keras input/output to a numpy array. ' TypeError: Cannot convert a symbolic Keras input/output to a numpy array. This error may indicate that you're trying to pass a symbolic value to a NumPy call, which is not supported. Or, you may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model.**

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1853, in _create_c_op c_op = pywrap_tf_session.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Fill dimensions must be >= 0 for '{{node ones}} = Fill[T=DT_FLOAT, index_type=DT_INT32](tf.concat_6/concat, ones/Const)' with input shapes: [3], [] and with input tensors computed as partial shapes: input[0] = [?,100,5].

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "example.py", line 262, in run_simple_experiment() File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke return callback(args, kwargs) File "example.py", line 35, in run_simple_experiment results_v = simple_experiment("vectorized") File "example.py", line 219, in simple_experiment model = tft_model.get_model_vectorized(model_capable_vectorize=True, single_sequence=True) File "/content/tft-speedup/tft-speedup/tft-speedup/tft_model.py", line 904, in get_model_vectorized historical_windowed, future_windowed, static_emb, batch_dimensions=2 File "/content/tft-speedup/tft-speedup/tft-speedup/tft_model.py", line 638, in build_base_tft_graph get_lstm(return_state=False), batch_dimensions, historical_features File "/content/tft-speedup/tft-speedup/tft-speedup/tf_utils.py", line 34, in timedistributed_over_more_batch_dimensions seq_squashed, batch_shape_orig = squash_batch_dimensions(seq, batch_dims) File "/content/tft-speedup/tft-speedup/tft-speedup/tf_utils.py", line 86, in squash_batch_dimensions new_shape = tf.concat([[-1], retain_shape], axis=-1) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 205, in wrapper result = dispatch(wrapper, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 122, in dispatch result = dispatcher.handle(op, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/layers/core.py", line 1450, in handle return TFOpLambda(op)(*args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 952, in call input_list) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1091, in _functional_construction_call inputs, input_masks, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 822, in _keras_tensor_symbolic_call return self._infer_output_signature(inputs, args, kwargs, input_masks) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 869, in _infer_output_signature keras_tensor.keras_tensor_from_tensor, outputs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py", line 659, in map_structure structure[0], [func(x) for x in entries], File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py", line 659, in structure[0], [func(x) for x in entries], File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/keras_tensor.py", line 606, in keras_tensor_from_tensor out = keras_tensor_cls.from_tensor(tensor) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/keras_tensor.py", line 193, in from_tensor inferred_value = array_ops.ones(shape=tensor).shape File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py", line 3132, in ones output = fill(shape, constant(one, dtype=dtype), name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(*args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py", line 239, in fill result = gen_array_ops.fill(dims, value, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 3358, in fill "Fill", dims=dims, value=value, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 592, in _create_op_internal compute_device) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 2016, in init control_input_ops, op_def) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1856, in _create_c_op raise ValueError(str(e)) ValueError: Fill dimensions must be >= 0 for '{{node ones}} = Fill[T=DT_FLOAT, index_type=DT_INT32](tf.concat_6/concat, ones/Const)' with input shapes: [3], [] and with input tensors computed as partial shapes: input[0] = [?,100,5].**

greatwhiz commented 3 years ago

It went through by downgrading to 2.3.2. However is it possible to fix the code to the latest TensorFlow?

holi-ampx commented 3 years ago

Thanks! We have updated the code to make it work on tensorflow 2.5. (Please tell us if the problem would persist)