Closed StrangeTcy closed 3 years ago
Ok, currently my guess is main.py
is using from tensorflow.python.tpu import tpu_config, tpu_estimator
to create and estimator, which in turn imports tensorflow_estimator
, which apparently hasn't been updated in years and has no idea how to deal with CrossShardOptimizers and CrossReplicaSum
I had this same issue (after trying to update tensorflow version) and after some amount of work, realized that "Source" cell set tensorflow type to 1.x Changed this to 2.x and it seems to work like a charm
@samriddhishree
Interesting, since last time I've tried not downgrading tensorflow to 1.15, I ran into the wrong mesh_shape
Pull request I opened w/ my changes, which worked on colab: https://github.com/EleutherAI/gpt-neo/pull/130 Try it out and see if it works for you :)
@samriddhishree
Ok, if I use just your create_tfrecords
patch, it still fails.
But if I use it and your optimizers
patch, a different error occurs:
TypeError: CrossShardOptimizer only works with tf.training.Optimizer and not Optimizer_v2. If you are using TPUStrategy, OptimizerV2 will sum gradients across replicas.If you are using TPUEstimator, you may instead sum your gradients with: grads = [tf.compat.v1.tpu.cross_replica_sum(g) for g in grads]. If you want to average your gradients, rescale your loss with: loss /= global_batch_size
Given the circumstances, that might be considered progress, I think :-)
did you try updating %tensorflow to 2.x in GPTNeo_example_notebook.ipynb? I didn't utilize CrossShardOptimizer in my code -- I started with base Colab notebook, and only made those 2 additional changes on the pull request
Why did you remove //args.processes)
from the line files = split_list(files, len(files) // args.processes)
(context)? Without this you aren't getting any benefit out of having multiple CPUs, you're just running on each of them independently. It should also cause data duplication, unless I am missing something.
@samriddhishree Yep, that as well. That is, I
%tensorflow 2.x
args.processes
CrossShardOptimizer
wrapperPerhaps I should try switching to tensorflow 2.x only.
ETA: Ok, if I only switch to tensorflow 2.x, the CrossShardOptimizer
error appears. If I switch to tensorflow and use the CrossShardOptimizer
wrapper patch, the above error appears, mentioning TPUStrategies and TPUEstimators
So, as I understand it, we can't just wrap a version 2 Optimizer, which mesh-tensorflow returns, in a CrossShardOptimizer
(as discussed here).
Perhaps we need to follow their suggestions and sum our gradients?
I think if you use MTF
under the hood, you wouldn't really need to run a CrossShardOptimizer
as MTF
already does that given a layout
and `mesh.
For example, the SIMDMeshImpl
in MTF
on TPU
would lower the reduce_sum` to a mesh specific allreduce, and that translates to proper TPU CrossReplicaSum.
CrossShardOptimizer performs CrossReplicaSum under the hood, and for this case, MTF
should get you covered as long as you pass in proper mesh
and layout
.
So for what you've done above, you probably want to remove CrossShardOptimizer
and simply run with tensorflow 2.x
?
@samriddhishree Yep, that as well. That is, I
* switched to ` %tensorflow 2.x` * removed the `args.processes` * added the `CrossShardOptimizer` wrapper
Perhaps I should try switching to tensorflow 2.x only.
ETA: Ok, if I only switch to tensorflow 2.x, the
CrossShardOptimizer
error appears. If I switch to tensorflow and use theCrossShardOptimizer
wrapper patch, the above error appears, mentioning TPUStrategies and TPUEstimators
The CrossShardOptimizer should be 100% unnecessary if you're using TF 2.x for the reasons @hthu describes. I am extremely willing to believe that our Colab file is out of date, but I am highly skeptical of the CrossShardOptimizer wrapper.
All patches should be based in TF 2.x. TF 1.x is going away and that all code that only runs in TF 1.x will break in the near future.
@hthu, I've just re-checked and
This results in the CrossShardOptimizer
error.
Just to clarify, I didn't write this code, I'm just trying to use it & make it work.
So, I think that MTF
is being used under the hood, but somehow -- due to it being an old version of mtf
or mtf
itself lacking the defintions necessary or something from the mtf
source code being redefined here -- it doesn't seem to produce an optimizer ready to be run on a TPU.
What mesh_shape
are you using?
You might want at least 2x2
to start with.
@hthu
I'm using all:8
ETA: in the other thread you say that the idea is to have no need to wrap anything. Which I get, and agree with. So now I'm trying to figure out where the error originates.
Do you have the latest stacktrace?
You probably shouldn't see any TpuEstimator
error message or you're still in 1.x realm?
@hthu
2021-02-15 11:04:26.763273: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-15 11:04:26.763406: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Current step 0
Downloading: 100% 1.04M/1.04M [00:00<00:00, 5.93MB/s]
Downloading: 100% 456k/456k [00:00<00:00, 3.22MB/s]
Downloading: 100% 1.36M/1.36M [00:00<00:00, 7.61MB/s]
Saving config to /content/drive/MyDrive/GPT2_colab_trial
2021-02-15 11:04:30.923993: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-15 11:04:30.949992: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-15 11:04:31.006147: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-02-15 11:04:31.006204: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (17c0abb61b15): /proc/driver/nvidia/version does not exist
2021-02-15 11:04:31.226784: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
Done!
params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7ff1a9057598>, {'n_head': 32, 'n_vocab': 50260, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0.1, 'train_batch_size': 256, 'attn_dropout': 0, 'train_steps': 572300, 'eval_steps': 0, 'predict_steps': 1, 'res_dropout': 0, 'eval_batch_size': 64, 'predict_batch_size': 1, 'iterations': 100, 'n_embd': 2048, 'datasets': ['nihexporter'], 'model': 'GPT', 'model_path': '/content/drive/MyDrive/GPT2_colab_trial', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global'], 'mesh_shape': 'all:8', 'layout': 'intermediate_expanded:x,heads:x,vocab:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 2048, 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'nihexporter': {'path': '/content/drive/MyDrive/openwebtext_tokenized/*.tfrecords', 'eval_path': '', 'n_vocab': 50256, 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 8, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': True, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 500, 'predict': False, 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False})
Using config: {'_model_dir': '/content/drive/MyDrive/GPT2_colab_trial', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
job {
name: "worker"
tasks {
key: 0
value: "10.104.141.194:8470"
}
}
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.104.141.194:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.104.141.194:8470', '_evaluation_master': 'grpc://10.104.141.194:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=8, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu.tpu_cluster_resolver.TPUClusterResolver object at 0x7ff1a8ff1240>}
_TPUContext: eval_on_tpu True
Querying Tensorflow master (grpc://10.104.141.194:8470) for TPU system metadata.
2021-02-15 11:04:31.237247: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:373] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
Initializing TPU system (master: grpc://10.104.141.194:8470) to fetch topology for model parallelism. This might take a while.
Found TPU system:
*** Num TPU Cores: 8
*** Num TPU Workers: 1
*** Num TPU Cores Per Worker: 8
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, -5884039716630823473)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, -3837800863408535284)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 3084760653800097538)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -1421254968311746768)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, -5123767212799155974)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 1549024653082208868)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 5930724887607135313)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, -4494021334020959541)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 1089526645275849123)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 1187669109324654925)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, -3269809841973767338)
From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Calling model_fn.
WARNING:root:Changing batch size with sequential_input() will result in some data being skipped or repeated. Please ensure your batch size stays constant throughout training.
num_cores_per_replica: 1
computation_shape: [1, 1, 1, 1]
num_replicas: 8
device_assignment.topology.device_coordinates: [[[0 0 0 0]
[0 0 0 1]
[1 0 0 0]
[1 0 0 1]
[0 1 0 0]
[0 1 0 1]
[1 1 0 0]
[1 1 0 1]]]
device_assignment.core_assignment: [[[0 0 0 0]]
[[0 0 0 1]]
[[1 0 0 0]]
[[1 0 0 1]]
[[0 1 0 0]]
[[0 1 0 1]]
[[1 1 0 0]]
[[1 1 0 1]]]
2021-02-15 11:04:43.111549: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
device_list = ['/job:worker/task:0/device:CPU:0']
SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', '']
SimdMeshImpl init: Shape[all=8] LayoutRules{('heads', 'x'), ('embd', 'y'), ('intermediate_expanded', 'x'), ('vocab', 'x'), ('memory_length', 'y')}
Device Assignment: <tensorflow.python.tpu.device_assignment.DeviceAssignment object at 0x7ff17ebca240>
serialize_num_microbatches: tokens_per_microbatch_per_replica=2048 batch_dim=Dimension(name='batch', size=256) sequence_length={'inputs': 2048, 'labels': 2048} batch_per_replica=256 num_microbatches=256
N TRAINABLE VARS:
1,315,581,952
ALL DIM NAMES:
intermediate_expanded
heads
embed_sequence
embd
vocab
### lots and lots
## and lots of model details
From /content/GPTNeo/utils.py:284: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
tf_update_ops: [<tf.Operation 'assign_948/group_deps_154' type=NoOp>, <tf.Operation 'assign_949/group_deps_308' type=NoOp>, <tf.Variable 'AssignAddVariableOp' shape=() dtype=int64>]
Create CheckpointSaverHook.
Bypassing TPUEstimator hook
training_loop marked as finished
Reraising captured error
Traceback (most recent call last):
File "main.py", line 256, in <module>
main(args)
File "main.py", line 230, in main
estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3130, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
self.config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3386, in _model_fn
_validate_tpu_training_graph(ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3817, in _validate_tpu_training_graph
'CrossShardOptimizer must be used for model training on TPUs.')
ValueError: CrossShardOptimizer must be used for model training on TPUs.
That's what happens if I try using tensorflow 2.x in a colab notebook.
TPUEstimator hook
gets bypassed for some reason I didn't really look into.
Hi! Thanks for the feedback. We pushed some updates including a totally revamped notebook, give it a try and let me know if you're still getting this error.
Running the example on a Colab TPU results in the following error: