Closed trillionmonster closed 3 years ago
@trillionmonster
@trillionmonster @usimarit I'm working on the TPU support on this branch: (monatis/tpu). Much of the work has already been completed and I hope to make a pull request with a Colab within a few days.
On the second issue, you don't need to create tfrecords beforehand with the newest code. When you run train_*.py
script, it will check if tfrecords exist or not, and it will return True
immediately just as in the local filesystem. If it tries to recreate them even when they have been already created beforehand, then you may need to check read access to that GCS bucket --please note that read permission is separate and independent from write access on GCS.
@trillionmonster @usimarit I'm working on the TPU support on this branch: (monatis/tpu). Much of the work has already been completed and I hope to make a pull request with a Colab within a few days.
On the second issue, you don't need to create tfrecords beforehand with the newest code. When you run
train_*.py
script, it will check if tfrecords exist or not, and it will returnTrue
immediately just as in the local filesystem. If it tries to recreate them even when they have been already created beforehand, then you may need to check read access to that GCS bucket --please note that read permission is separate and independent from write access on GCS.
clone and install but not work:
2021-01-07 06:24:42.559152: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2021-01-07 06:24:45.764710: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-07 06:24:45.765828: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-07 06:24:45.774893: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2021-01-07 06:24:45.774941: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (3472e8d374ea): /proc/driver/nvidia/version does not exist 2021-01-07 06:24:45.777737: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set Cannot import RNNT loss in warprnnt. Falls back to RNNT in TensorFlow Note: The RNNT in Tensorflow is not supported for CPU yet Loading max lengths from /root/train.max_lengths.txt ... Loading max lengths from /root/eval.max_lengths.txt ... 2021-01-07 06:24:47.491447: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-07 06:24:47.499031: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.25.169.170:8470} 2021-01-07 06:24:47.499082: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:40883} 2021-01-07 06:24:47.516967: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.25.169.170:8470} 2021-01-07 06:24:47.517028: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:40883} 2021-01-07 06:24:47.517496: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://localhost:40883 All TPUs: [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')]
………………………………
TFRecords're already existed: train
TFRecords're already existed: eval
[Train] | | 1/? [00:59<00:00, 59.98s/batch]Traceback (most recent call last):
File "train_tpu_subword_conformer.py", line 146, in
@trillionmonster Forward and backward passes are actually working, but it throws an error in _check_*
methods in BaseRunner
class. There is a possibility that conditionals or other statements in these methods may not be supported on TPU. If this is the case, we may also need to refactor this class with pure TF ops. I'll deal with it later on this week, probably at the weekend.
@monatis I'm trying to fix it ,too 😄
@monatis You can try commenting out the _check*
lines and copy their contents to the end of each epoch so that the log_interval
, save_interval
and eval_interval
will be ignored and we save and validate model after each epoch.
If it runs smoothly after first epoch, then the problem surely lies in the _check*
functions and we will have to do a copy of self.steps
in python variable, for example self._steps
and use that to check the python condition, instead of using directly from tf.Variable
of self.steps
and the self.steps
are for storing checkpoints so we cannot remove it.
And neither I have GCS nor want to use your GCS so I just can provide ideas to solve the problems :smile:
@monatis maybe the dataset is too large or GPU on colab is too slow ,it takes 220 hours on GPU . I think TPU train it's necessary to use tpu 😄
[Train]: 0% 0/103240 [00:00<?, ?batch/s]2021-01-08 05:13:19.107514: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-01-08 05:13:19.107919: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2200000000 Hz 2021-01-08 05:13:59.745275: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2021-01-08 05:14:10.661856: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 30 of 300 2021-01-08 05:14:21.133863: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 54 of 300 2021-01-08 05:14:30.778680: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 83 of 300 2021-01-08 05:14:40.190762: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 110 of 300 2021-01-08 05:14:50.421657: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 141 of 300 2021-01-08 05:15:00.217172: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 164 of 300 2021-01-08 05:15:10.108075: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 191 of 300 2021-01-08 05:15:21.182262: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 223 of 300 2021-01-08 05:15:31.099524: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 251 of 300 2021-01-08 05:15:40.172470: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 280 of 300 2021-01-08 05:15:45.776024: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:230] Shuffle buffer filled. 2021-01-08 05:15:53.641993: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7 [Train] [Epoch 1/20]: 0% 45/103240 [06:50<220:52:12, 7.71s/batch, transducer_loss=848.9718]^C
@usimarit @monatis I change the code in base_runner.py
def _end_epoch(self):
self.save_checkpoint()
self.save_model_weights()
self._write_to_tensorboard(self.train_metrics, self.steps, stage="train")
for metric in self.train_metrics.keys():
self.train_metrics[metric].reset_states()
self._eval_epoch()
def _train_epoch(self):
"""Train model one epoch."""
train_iterator = iter(self.train_data_loader)
train_steps = 0
while True:
try:
self._train_function(train_iterator) # Run train step
except StopIteration:
break
except tf.errors.OutOfRangeError:
break
except Exception as e:
raise e
# # Update steps
self.steps.assign_add(1)
self.train_progbar.update(1)
train_steps += 1
# # Run save checkpoint
# self._check_save_interval()
# # Print epoch info
self.train_progbar.set_description_str(
f"[Train] [Epoch {self.epochs}/{self.config.num_epochs}]")
# # Print train info to progress bar
# self._print_train_metrics(self.train_progbar)
# # Run logging
# self._check_log_interval()
# # Run evaluation
# self._check_eval_interval()
self._end_epoch()
self.train_steps_per_epoch = train_steps
self.train_progbar.total = self.total_train_steps
self.train_progbar.refresh()
and error changed :
Total params: 12,406,941
Trainable params: 12,402,333
Non-trainable params: 4,608
________________________________________________________________________________________________________________________
TFRecords're already existed: train
TFRecords're already existed: eval
[Train] [Epoch 1/20] | | 6854/? [02:20<00:00, 92.38batch/s]Traceback (most recent call last):
File "train_tpu_conformer.py", line 134, in <module>
conformer_trainer.fit(train_dataset, eval_dataset, train_bs=args.bs, eval_bs=args.bs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 317, in fit
self.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 192, in run
self._train_epoch()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 213, in _train_epoch
raise e
File "/usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py", line 207, in _train_epoch
self._train_function(train_iterator) # Run train step
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 862, in _call
results = self._stateful_fn(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.CancelledError: 9 root error(s) found.
(0) Cancelled: Iterator was cancelled
[[node IteratorGetNext_6 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py:247) ]]
(1) Cancelled: Function was cancelled before it was started
(2) Cancelled: Function was cancelled before it was started
(3) Cancelled: Function was cancelled before it was started
(4) Cancelled: Function was cancelled before it was started
(5) Cancelled: Function was cancelled before it was started
(6) Cancelled: Function was cancelled before it was started
(7) Cancelled: Function was cancelled before it was started
(8) Cancelled: Function was cancelled before it was started
0 successful operations.
0 derived errors ignored.
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors. [Op:__inference__train_function_133151]
Errors may have originated from an input operation.
Input Source operations connected to node IteratorGetNext_6:
iterator_14 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_asr/runners/base_runners.py:207)
Function call stack:
_train_function -> _train_function -> _train_function -> _train_function -> _train_function -> _train_function -> _train_function -> _train_function -> _train_function
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/tpu_strategy.py", line 738, in async_wait
context.async_wait()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2330, in async_wait
context().sync_executors()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 645, in sync_executors
pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.CancelledError: 9 root error(s) found.
(0) Cancelled: {{function_node __inference__train_function_133151}} Iterator was cancelled
[[{{node IteratorGetNext_6}}]]
(1) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
(2) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
(3) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
(4) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
(5) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
(6) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
(7) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
(8) Cancelled: {{function_node __inference__train_function_133151}} Function was cancelled before it was started
0 successful operations.
0 derived errors ignored.
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
@trillionmonster Can you try catching tf.errors.CancelledError
also and break from the while loop there?
@usimarit I'll have a look at it at the weekend and implement a TPUBaseRunner
if necessary. TPU training yields a significant speedup until it stops with the error @trillionmonster posted in the last comment, so it can contribute a lot to the research when fully debugged.
@monatis Yeah you can do either:
TPUBaseTrainer
inherit directly from BaseTrainer
BaseTrainer
to support both GPU and TPU.The second way is recommended since in the first way we will have to create other classes TPUTransducerTrainer
and TPUTransducerTrainerGA
inherit from TPUBaseTRainer
to use TPU for Transducer models.
I have a try on tpudev branch ,but not work again .
still this error:
"""(0) Cancelled: Iterator was cancelled"""
@monatis @trillionmonster I've been testing on TPU, it seems like the TPU on colab can not use tf.train.Checkpoint
and have some other problems.
@monatis @trillionmonster I've been testing on TPU, it seems like the TPU on colab can not use
tf.train.Checkpoint
and have some other problems.
tf.train.Checkpoint seem works ,but dataset Iterator would randomly cancelled ,no idea
I tried with different batch sizes and numbers of TFRecord shards, but unfortunately iteration is cancelled randomly after thousands of steps in the base ocasion on Colab, and couldn't figure out why yet. Similar bugs have been reported recently. I think our best chance will be try it out on Cloud TPU --I'm in the process of obtaining TFRC credits for this.
@monatis @trillionmonster I'm planning to make custom training loop deprecated and support tf.keras.Model.fit
with some custom callbacks :smile: May be it support TPU better without needing bunch of customization, therefore we can focus more on sota models. Have you guys tried tf.keras.Model.fit
on TPU yet?
@usimarit Never thought of doing so :D It may be easier to convert RNNT loss to a plain old Keras .fit()
-compatible API than to struggle with all the caveats of custom training loop on TPUs. Definitely worth giving a try 🚀
@monatis @usimarit I'm trying to use Huggingface‘s code to train a transformer-based model, with a mask layer . A lot of change .
@usimarit @trillionmonster I finally managed to train it for 20 epochs with the custom training loop on this Colab. But I still need to implement TensorBoard logging and evaluation at the end of each epoch.
Implementation considerations are as follows:
iter(train_data_loader)
didn't work on TPU, so I iterated over the instance of DistributedDataset
directly (see train_dl
in the notebook).DistributedDataset
didn't work eagerly, either. So I decorated train_epoch
function with tf.function
.tf.train.Checkpoint.save()
caused problems both inside and outside a tf.function
decorated function. I was able to save a checkpoint only with tf.train.Checkpoint.write()
inside a tf.function
decorated function, which is a lower-level API. I will re-examine this.tf.keras.Model.save_weights()
apparently fails to write latest.h5
to GCS, so I wrote to the local filesystem and then copied it to GCS.stdout
when training on TPU. Note: for TensorBoard logging on TPU, we need to use tf.config.set_soft_device_placement(True)
.@monatis GREAT WORK!! I'm running this code
I find Another trick :
I find that the max len of speech feature is related to duration in transcript ,and I rewrited the code to get max length ,much faster .
def _max_len(self, duration, transcript):
with tf.device("/CPU:0"):
duration = int(float(duration)*100//1)+1
duration = tf.cast(duration, tf.int32)
label = self.text_featurizer.extract(transcript.decode("utf-8"))
label_length = tf.cast(tf.shape(label)[0], tf.int32)
prediction = self.text_featurizer.prepand_blank(label)
prediction_length = tf.cast(tf.shape(prediction)[0], tf.int32)
return duration, label_length,prediction_length
def compute_max_lengths(self, max_lengths_path: str = None):
assert max_lengths_path is not None, "max_lengths_path cannot be None"
max_lengths_path = os.path.join(preprocess_paths(max_lengths_path), f"{self.stage}.max_lengths.txt")
if tf.io.gfile.exists(max_lengths_path):
print(f"Loading max lengths from {max_lengths_path} ...")
with tf.io.gfile.GFile(max_lengths_path, 'r') as f:
self.max_input_length, self.max_label_length, self.max_prediction_length = [int(l) for l in f.read().split()]
return
lines = self.read_entries()
for line in tqdm.tqdm(lines, desc=f"Computing max lengths for entries in {self.stage} dataset"):
input_length, label_length,prediction_length = self._max_len(str(line[1]), str(line[2]).encode("utf-8"))
self.max_input_length = input_length if input_length > self.max_input_length else self.max_input_length
self.max_label_length = label_length if label_length > self.max_label_length else self.max_label_length
self.max_prediction_length = prediction_length if prediction_length > self.max_prediction_length else self.max_prediction_length
self.max_input_length = int(self.max_input_length.numpy())
self.max_label_length = int(self.max_label_length.numpy())
self.max_prediction_length = int(self.max_prediction_length.numpy())
with tf.io.gfile.GFile(max_lengths_path, 'w') as f:
f.write(f"{self.max_input_length} {self.max_label_length} {self.max_prediction_length}")
print(f"Max lengths written to {max_lengths_path}")
@trillionmonster Nice tip. Mine was a naive implementation and quite slow, thus planning to parallelize it. This will provide a speedup.
@monatis @trillionmonster I added support for training with built-in keras fit
. Please checkout the PR #118. Let's see if we can run it on TPU :smile:
@usimarit good work, I'll definitely give it a try, but probably it won't help with TPU because I localized the root cause of problem with TPU training in ASRDataset
. Simply creating an iterator over an instance of it and calling next(iterator)
throws an exception even without calling any training step function at all. And, my guess is that it is related to the tf.numpy_function()
call for preprocessing. According to its docs, the second "known limitation" is as follows:
The operation must run in the same address space as the Python program that calls
tf.numpy_function()
. If you are using distributed TensorFlow, you must run atf.distribute.Server
in the same process as the program that callstf.numpy_function
you must pin the created operation to a device in that server (e.g. usingwith tf.device():
).
So, I'm considering to create TFRecord files from preprocessed input with feature extraction and augmentation applied instead of applying them on the fly during training. In this case I will need to store num_epochs
x more data, but my observation is that it is the usual case for TPU training with complex preprocessing/augmentations. In this case, we may not be able to fully integrate TPU training into the core repo, but I can write helper scripts and detailed notebooks / tutorials. What do you think?
@monatis Yeah we can do that way. And also find a way to overcome that limitation, may be we can read the audio files to arrays, transcripts to arrays of classes, store them to tfrecords and change the specaugment to use pure tf (the TFSpeechFeaturizer
already in pure tf so we don't need to change). Then we can do augmentation on the fly.
@usimarit Sounds like a good plan. And it will accelerate data pipeline for GPUs as well. There's an implementation of SpecAugment here. Plus, we can decode audio with tf.audio.decode_wav
, thus removing librosa dependency as well.
Hey guys, checkout the newest PR #130 to see if it can run on TPU :smile:
Hi @usimarit I need to cherrypick some of the commits in my fork and then I can give it a try.
Hi @monatis, @trillionmonster
I added support for TPU training in PR #146
Tested with keras builtin fit
. You guys can try it :smile:
hello , I found two problems when use TPU
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto raise ValueError("None values not supported.")
ValueError: None values not supported.
seem like dynamic graph not support on TPU,
let's figure out how to train with TPU