TPU not work - Githubissues

trillionmonster commented 3 years ago

hello , I found two problems when use TPU

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto raise ValueError("None values not supported.")

ValueError: None values not supported.

seem like dynamic graph not support on TPU,

tfrecord file created by create_tfrecords saved in google storage will be created again .

let's figure out how to train with TPU

nglehuy commented 3 years ago

@trillionmonster

TPU is not supported yet.
I'll check it later :smile:

monatis commented 3 years ago

@trillionmonster @usimarit I'm working on the TPU support on this branch: (monatis/tpu). Much of the work has already been completed and I hope to make a pull request with a Colab within a few days.

On the second issue, you don't need to create tfrecords beforehand with the newest code. When you run train_*.py script, it will check if tfrecords exist or not, and it will return True immediately just as in the local filesystem. If it tries to recreate them even when they have been already created beforehand, then you may need to check read access to that GCS bucket --please note that read permission is separate and independent from write access on GCS.

trillionmonster commented 3 years ago

@trillionmonster @usimarit I'm working on the TPU support on this branch: (monatis/tpu). Much of the work has already been completed and I hope to make a pull request with a Colab within a few days.

On the second issue, you don't need to create tfrecords beforehand with the newest code. When you run train_*.py script, it will check if tfrecords exist or not, and it will return True immediately just as in the local filesystem. If it tries to recreate them even when they have been already created beforehand, then you may need to check read access to that GCS bucket --please note that read permission is separate and independent from write access on GCS.

clone and install but not work:

2021-01-07 06:24:42.559152: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2021-01-07 06:24:45.764710: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-07 06:24:45.765828: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-07 06:24:45.774893: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2021-01-07 06:24:45.774941: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (3472e8d374ea): /proc/driver/nvidia/version does not exist 2021-01-07 06:24:45.777737: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set Cannot import RNNT loss in warprnnt. Falls back to RNNT in TensorFlow Note: The RNNT in Tensorflow is not supported for CPU yet Loading max lengths from /root/train.max_lengths.txt ... Loading max lengths from /root/eval.max_lengths.txt ... 2021-01-07 06:24:47.491447: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-07 06:24:47.499031: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.25.169.170:8470} 2021-01-07 06:24:47.499082: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:40883} 2021-01-07 06:24:47.516967: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.25.169.170:8470} 2021-01-07 06:24:47.517028: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:40883} 2021-01-07 06:24:47.517496: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:40883 All TPUs: [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')]

………………………………

TFRecords're already existed: train TFRecords're already existed: eval [Train] | | 1/? [00:59<00:00, 59.98s/batch]Traceback (most recent call last): File "train_tpu_subword_conformer.py", line 146, in conformer_trainer.fit(train_dataset, eval_dataset, train_bs=args.bs, eval_bs=args.bs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 312, in fit self.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 192, in run self._train_epoch() File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 221, in _train_epoch self._check_save_interval() File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 327, in _check_save_interval if (self.steps % self.config.save_interval_steps == 0) or \ File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 992, in bool return bool(self._numpy()) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1039, in _numpy six.raise_from(core._status_to_exception(e.code, e.message), None) # pylint: disable=protected-access File "", line 3, in raise_from tensorflow.python.framework.errors_impl.UnavailableError: 7 root error(s) found. (0) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[tpu_compile_succeeded_assert/_15209540557275533872/_3/_23]] (1) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[TPUReplicate/_compile/_8829250846479837039/_2/_28]] (2) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[TPUReplicate/_compile/_8829250846479837039/_2/_44]] (3) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitione ... [truncated] 2021-01-07 06:27:28.688984: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: 7 root error(s) found. (0) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[tpu_compile_succeeded_assert/_15209540557275533872/_3/_23]] (1) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[TPUReplicate/_compile/_8829250846479837039/_2/_28]] (2) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[TPUReplicate/_compile/_8829250846479837039/_2/_44]] (3) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitione ... [truncated] Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/tpu_strategy.py", line 738, in async_wait context.async_wait() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2330, in async_wait context().sync_executors() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 645, in sync_executors pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle) tensorflow.python.framework.errors_impl.UnavailableError: 7 root error(s) found. (0) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[tpu_compile_succeeded_assert/_15209540557275533872/_3/_23]] (1) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[TPUReplicate/_compile/_8829250846479837039/_2/_28]] (2) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitionedCall}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]] [[TPUReplicate/_compile/_8829250846479837039/_2/_44]] (3) Unavailable: {{function_node inferencetrain_function_133151}} failed to connect to all addresses Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0: :{"created":"@1610000786.782798402","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1610000786.782766964","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]} [[{{node StatefulPartitione ... [truncated] 2021-01-07 06:27:29.320774: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 38997, Output num: 0 Additional GRPC error information from remote target /job:worker/replica:0/task:0: :{"created":"@1610000849.320642175","description":"Error received from peer ipv4:10.25.169.170:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 38997, Output num: 0","grpc_status":3}

monatis commented 3 years ago

@trillionmonster Forward and backward passes are actually working, but it throws an error in _check_* methods in BaseRunner class. There is a possibility that conditionals or other statements in these methods may not be supported on TPU. If this is the case, we may also need to refactor this class with pure TF ops. I'll deal with it later on this week, probably at the weekend.

trillionmonster commented 3 years ago

@monatis I'm trying to fix it ,too 😄

nglehuy commented 3 years ago

@monatis You can try commenting out the _check* lines and copy their contents to the end of each epoch so that the log_interval, save_interval and eval_interval will be ignored and we save and validate model after each epoch. If it runs smoothly after first epoch, then the problem surely lies in the _check* functions and we will have to do a copy of self.steps in python variable, for example self._steps and use that to check the python condition, instead of using directly from tf.Variable of self.steps and the self.steps are for storing checkpoints so we cannot remove it. And neither I have GCS nor want to use your GCS so I just can provide ideas to solve the problems :smile:

trillionmonster commented 3 years ago

@monatis maybe the dataset is too large or GPU on colab is too slow ,it takes 220 hours on GPU . I think TPU train it's necessary to use tpu 😄

[Train]: 0% 0/103240 [00:00<?, ?batch/s]2021-01-08 05:13:19.107514: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-01-08 05:13:19.107919: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2200000000 Hz 2021-01-08 05:13:59.745275: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2021-01-08 05:14:10.661856: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 30 of 300 2021-01-08 05:14:21.133863: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 54 of 300 2021-01-08 05:14:30.778680: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 83 of 300 2021-01-08 05:14:40.190762: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 110 of 300 2021-01-08 05:14:50.421657: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 141 of 300 2021-01-08 05:15:00.217172: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 164 of 300 2021-01-08 05:15:10.108075: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 191 of 300 2021-01-08 05:15:21.182262: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 223 of 300 2021-01-08 05:15:31.099524: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 251 of 300 2021-01-08 05:15:40.172470: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 280 of 300 2021-01-08 05:15:45.776024: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:230] Shuffle buffer filled. 2021-01-08 05:15:53.641993: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7 [Train] [Epoch 1/20]: 0% 45/103240 [06:50<220:52:12, 7.71s/batch, transducer_loss=848.9718]^C

trillionmonster commented 3 years ago

@usimarit @monatis I change the code in base_runner.py

  def _end_epoch(self):

      self.save_checkpoint()
      self.save_model_weights()

      self._write_to_tensorboard(self.train_metrics, self.steps, stage="train")

      for metric in self.train_metrics.keys():
        self.train_metrics[metric].reset_states()

      self._eval_epoch()

  def _train_epoch(self):
        """Train model one epoch."""
        train_iterator = iter(self.train_data_loader)
        train_steps = 0
        while True:
            try:
                self._train_function(train_iterator)  # Run train step
            except StopIteration:
                break
            except tf.errors.OutOfRangeError:
                break
            except Exception as e:
                raise e

            # # Update steps
            self.steps.assign_add(1)
            self.train_progbar.update(1)
            train_steps += 1

            # # Run save checkpoint
            # self._check_save_interval()

            # # Print epoch info
            self.train_progbar.set_description_str(
                f"[Train] [Epoch {self.epochs}/{self.config.num_epochs}]")

            # # Print train info to progress bar
            # self._print_train_metrics(self.train_progbar)

            # # Run logging
            # self._check_log_interval()

            # # Run evaluation
            # self._check_eval_interval()

        self._end_epoch()

        self.train_steps_per_epoch = train_steps
        self.train_progbar.total = self.total_train_steps
        self.train_progbar.refresh()

and error changed :

Total params: 12,406,941
Trainable params: 12,402,333
Non-trainable params: 4,608
________________________________________________________________________________________________________________________
TFRecords're already existed: train
TFRecords're already existed: eval
[Train] [Epoch 1/20] |                    | 6854/? [02:20<00:00, 92.38batch/s]Traceback (most recent call last):
  File "train_tpu_conformer.py", line 134, in <module>
    conformer_trainer.fit(train_dataset, eval_dataset, train_bs=args.bs, eval_bs=args.bs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 317, in fit
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 192, in run
    self._train_epoch()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 213, in _train_epoch
    raise e
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 207, in _train_epoch
    self._train_function(train_iterator)  # Run train step
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 862, in _call
    results = self._stateful_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.CancelledError: 9 root error(s) found.
  (0) Cancelled:  Iterator was cancelled
     [[node IteratorGetNext_6 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py:247) ]]
  (1) Cancelled:  Function was cancelled before it was started
  (2) Cancelled:  Function was cancelled before it was started
  (3) Cancelled:  Function was cancelled before it was started
  (4) Cancelled:  Function was cancelled before it was started
  (5) Cancelled:  Function was cancelled before it was started
  (6) Cancelled:  Function was cancelled before it was started
  (7) Cancelled:  Function was cancelled before it was started
  (8) Cancelled:  Function was cancelled before it was started
0 successful operations.
0 derived errors ignored.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors. [Op:__inference__train_function_133151]

Errors may have originated from an input operation.
Input Source operations connected to node IteratorGetNext_6:
 iterator_14 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py:207)

Function call stack:
_train_function -> _train_function -> _train_function -> _train_function -> _train_function -> _train_function -> _train_function -> _train_function -> _train_function

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/tpu_strategy.py", line 738, in async_wait
    context.async_wait()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2330, in async_wait
    context().sync_executors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 645, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.CancelledError: 9 root error(s) found.
  (0) Cancelled: {{function_node __inference__train_function_133151}} Iterator was cancelled
     [[{{node IteratorGetNext_6}}]]
  (1) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
  (2) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
  (3) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
  (4) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
  (5) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
  (6) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
  (7) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
  (8) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
0 successful operations.
0 derived errors ignored.
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.

monatis commented 3 years ago

@trillionmonster Can you try catching tf.errors.CancelledError also and break from the while loop there?

@usimarit I'll have a look at it at the weekend and implement a TPUBaseRunner if necessary. TPU training yields a significant speedup until it stops with the error @trillionmonster posted in the last comment, so it can contribute a lot to the research when fully debugged.

nglehuy commented 3 years ago

@monatis Yeah you can do either:

create a class TPUBaseTrainer inherit directly from BaseTrainer
just modify the BaseTrainer to support both GPU and TPU.

The second way is recommended since in the first way we will have to create other classes TPUTransducerTrainer and TPUTransducerTrainerGA inherit from TPUBaseTRainer to use TPU for Transducer models.

trillionmonster commented 3 years ago

I have a try on tpudev branch ,but not work again .

still this error:

"""(0) Cancelled: Iterator was cancelled"""

nglehuy commented 3 years ago

@monatis @trillionmonster I've been testing on TPU, it seems like the TPU on colab can not use tf.train.Checkpoint and have some other problems.

trillionmonster commented 3 years ago

@monatis @trillionmonster I've been testing on TPU, it seems like the TPU on colab can not use tf.train.Checkpoint and have some other problems.

tf.train.Checkpoint seem works ，but dataset Iterator would randomly cancelled ，no idea

monatis commented 3 years ago

I tried with different batch sizes and numbers of TFRecord shards, but unfortunately iteration is cancelled randomly after thousands of steps in the base ocasion on Colab, and couldn't figure out why yet. Similar bugs have been reported recently. I think our best chance will be try it out on Cloud TPU --I'm in the process of obtaining TFRC credits for this.

nglehuy commented 3 years ago

@monatis @trillionmonster I'm planning to make custom training loop deprecated and support tf.keras.Model.fit with some custom callbacks :smile: May be it support TPU better without needing bunch of customization, therefore we can focus more on sota models. Have you guys tried tf.keras.Model.fit on TPU yet?

monatis commented 3 years ago

@usimarit Never thought of doing so :D It may be easier to convert RNNT loss to a plain old Keras .fit()-compatible API than to struggle with all the caveats of custom training loop on TPUs. Definitely worth giving a try 🚀

trillionmonster commented 3 years ago

@monatis @usimarit I'm trying to use Huggingface‘s code to train a transformer-based model, with a mask layer . A lot of change .

monatis commented 3 years ago

@usimarit @trillionmonster I finally managed to train it for 20 epochs with the custom training loop on this Colab. But I still need to implement TensorBoard logging and evaluation at the end of each epoch.

Implementation considerations are as follows:

Explicit creation of iterator with iter(train_data_loader) didn't work on TPU, so I iterated over the instance of DistributedDataset directly (see train_dlin the notebook).
Iterating over DistributedDataset didn't work eagerly, either. So I decorated train_epoch function with tf.function.
Checkpointing with tf.train.Checkpoint.save() caused problems both inside and outside a tf.function decorated function. I was able to save a checkpoint only with tf.train.Checkpoint.write() inside a tf.function decorated function, which is a lower-level API. I will re-examine this.
tf.keras.Model.save_weights() apparently fails to write latest.h5 to GCS, so I wrote to the local filesystem and then copied it to GCS.
I still need to add a function for TensorBoard logging, and I couldn't find any useful way of logging losses to stdout when training on TPU. Note: for TensorBoard logging on TPU, we need to use tf.config.set_soft_device_placement(True).

trillionmonster commented 3 years ago

@monatis GREAT WORK!! I'm running this code

I find Another trick :

I find that the max len of speech feature is related to duration in transcript ,and I rewrited the code to get max length ,much faster .

def _max_len(self, duration, transcript):
        with tf.device("/CPU:0"):
            duration = int(float(duration)*100//1)+1
            duration = tf.cast(duration, tf.int32)
            label = self.text_featurizer.extract(transcript.decode("utf-8"))
            label_length = tf.cast(tf.shape(label)[0], tf.int32)
            prediction = self.text_featurizer.prepand_blank(label)
            prediction_length = tf.cast(tf.shape(prediction)[0], tf.int32)

            return duration, label_length,prediction_length

def compute_max_lengths(self, max_lengths_path: str = None):
        assert max_lengths_path is not None, "max_lengths_path cannot be None"
        max_lengths_path = os.path.join(preprocess_paths(max_lengths_path), f"{self.stage}.max_lengths.txt")
        if tf.io.gfile.exists(max_lengths_path):
            print(f"Loading max lengths from {max_lengths_path} ...")
            with tf.io.gfile.GFile(max_lengths_path, 'r') as f:
                self.max_input_length, self.max_label_length, self.max_prediction_length = [int(l) for l in f.read().split()]
                return

        lines = self.read_entries()
        for line in tqdm.tqdm(lines, desc=f"Computing max lengths for entries in {self.stage} dataset"):
            input_length, label_length,prediction_length = self._max_len(str(line[1]), str(line[2]).encode("utf-8"))
            self.max_input_length = input_length if input_length > self.max_input_length else self.max_input_length
            self.max_label_length = label_length if label_length > self.max_label_length else self.max_label_length
            self.max_prediction_length = prediction_length if prediction_length > self.max_prediction_length else self.max_prediction_length

        self.max_input_length = int(self.max_input_length.numpy())
        self.max_label_length = int(self.max_label_length.numpy())
        self.max_prediction_length = int(self.max_prediction_length.numpy())

        with tf.io.gfile.GFile(max_lengths_path, 'w') as f:
            f.write(f"{self.max_input_length} {self.max_label_length} {self.max_prediction_length}")

        print(f"Max lengths written to {max_lengths_path}")

monatis commented 3 years ago

@trillionmonster Nice tip. Mine was a naive implementation and quite slow, thus planning to parallelize it. This will provide a speedup.

nglehuy commented 3 years ago

@monatis @trillionmonster I added support for training with built-in keras fit. Please checkout the PR #118. Let's see if we can run it on TPU :smile:

monatis commented 3 years ago

@usimarit good work, I'll definitely give it a try, but probably it won't help with TPU because I localized the root cause of problem with TPU training in ASRDataset. Simply creating an iterator over an instance of it and calling next(iterator) throws an exception even without calling any training step function at all. And, my guess is that it is related to the tf.numpy_function() call for preprocessing. According to its docs, the second "known limitation" is as follows:

The operation must run in the same address space as the Python program that calls tf.numpy_function(). If you are using distributed TensorFlow, you must run a tf.distribute.Server in the same process as the program that calls tf.numpy_function you must pin the created operation to a device in that server (e.g. using with tf.device():).

So, I'm considering to create TFRecord files from preprocessed input with feature extraction and augmentation applied instead of applying them on the fly during training. In this case I will need to store num_epochsx more data, but my observation is that it is the usual case for TPU training with complex preprocessing/augmentations. In this case, we may not be able to fully integrate TPU training into the core repo, but I can write helper scripts and detailed notebooks / tutorials. What do you think?

nglehuy commented 3 years ago

@monatis Yeah we can do that way. And also find a way to overcome that limitation, may be we can read the audio files to arrays, transcripts to arrays of classes, store them to tfrecords and change the specaugment to use pure tf (the TFSpeechFeaturizer already in pure tf so we don't need to change). Then we can do augmentation on the fly.

monatis commented 3 years ago

@usimarit Sounds like a good plan. And it will accelerate data pipeline for GPUs as well. There's an implementation of SpecAugment here. Plus, we can decode audio with tf.audio.decode_wav, thus removing librosa dependency as well.

nglehuy commented 3 years ago

Hey guys, checkout the newest PR #130 to see if it can run on TPU :smile:

monatis commented 3 years ago

Hi @usimarit I need to cherrypick some of the commits in my fork and then I can give it a try.

nglehuy commented 3 years ago

Hi @monatis, @trillionmonster I added support for TPU training in PR #146 Tested with keras builtin fit. You guys can try it :smile:

TensorSpeech / TensorFlowASR

TPU not work #100