CrossShardOptimizer must be used for model training on TPUs

StrangeTcy commented 3 years ago

Running the example on a Colab TPU results in the following error:

File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 230, in main
    estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3130, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3386, in _model_fn
    _validate_tpu_training_graph(ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3817, in _validate_tpu_training_graph
    'CrossShardOptimizer must be used for model training on TPUs.')
ValueError: CrossShardOptimizer must be used for model training on TPUs.

StrangeTcy commented 3 years ago

Ok, currently my guess is main.py is using from tensorflow.python.tpu import tpu_config, tpu_estimator to create and estimator, which in turn imports tensorflow_estimator, which apparently hasn't been updated in years and has no idea how to deal with CrossShardOptimizers and CrossReplicaSum

samriddhishree commented 3 years ago

I had this same issue (after trying to update tensorflow version) and after some amount of work, realized that "Source" cell set tensorflow type to 1.x Changed this to 2.x and it seems to work like a charm

StrangeTcy commented 3 years ago

@samriddhishree Interesting, since last time I've tried not downgrading tensorflow to 1.15, I ran into the wrong mesh_shape

samriddhishree commented 3 years ago

Pull request I opened w/ my changes, which worked on colab: https://github.com/EleutherAI/gpt-neo/pull/130 Try it out and see if it works for you :)

StrangeTcy commented 3 years ago

@samriddhishree Ok, if I use just your create_tfrecords patch, it still fails. But if I use it and your optimizers patch, a different error occurs:

TypeError: CrossShardOptimizer only works with tf.training.Optimizer and not Optimizer_v2. If you are using TPUStrategy, OptimizerV2 will sum gradients across replicas.If you are using TPUEstimator, you may instead sum your gradients with: grads = [tf.compat.v1.tpu.cross_replica_sum(g) for g in grads]. If you want to average your gradients, rescale your loss with: loss /= global_batch_size

Given the circumstances, that might be considered progress, I think :-)

samriddhishree commented 3 years ago

did you try updating %tensorflow to 2.x in GPTNeo_example_notebook.ipynb? I didn't utilize CrossShardOptimizer in my code -- I started with base Colab notebook, and only made those 2 additional changes on the pull request

StellaAthena commented 3 years ago

Why did you remove //args.processes) from the line files = split_list(files, len(files) // args.processes) (context)? Without this you aren't getting any benefit out of having multiple CPUs, you're just running on each of them independently. It should also cause data duplication, unless I am missing something.

StrangeTcy commented 3 years ago

@samriddhishree Yep, that as well. That is, I

switched to %tensorflow 2.x
removed the args.processes
added the CrossShardOptimizer wrapper

Perhaps I should try switching to tensorflow 2.x only.

ETA: Ok, if I only switch to tensorflow 2.x, the CrossShardOptimizer error appears. If I switch to tensorflow and use the CrossShardOptimizer wrapper patch, the above error appears, mentioning TPUStrategies and TPUEstimators

StrangeTcy commented 3 years ago

So, as I understand it, we can't just wrap a version 2 Optimizer, which mesh-tensorflow returns, in a CrossShardOptimizer (as discussed here). Perhaps we need to follow their suggestions and sum our gradients?

hthu commented 3 years ago

I think if you use MTF under the hood, you wouldn't really need to run a CrossShardOptimizer as MTF already does that given a layout and `mesh.

For example, the SIMDMeshImpl in MTF on TPU would lower the reduce_sum` to a mesh specific allreduce, and that translates to proper TPU CrossReplicaSum.

CrossShardOptimizer performs CrossReplicaSum under the hood, and for this case, MTF should get you covered as long as you pass in proper mesh and layout.

So for what you've done above, you probably want to remove CrossShardOptimizer and simply run with tensorflow 2.x?

StellaAthena commented 3 years ago

@samriddhishree Yep, that as well. That is, I
* switched to ` %tensorflow 2.x`

* removed the `args.processes`

* added the `CrossShardOptimizer` wrapper
Perhaps I should try switching to tensorflow 2.x only.

ETA: Ok, if I only switch to tensorflow 2.x, the CrossShardOptimizer error appears. If I switch to tensorflow and use the CrossShardOptimizer wrapper patch, the above error appears, mentioning TPUStrategies and TPUEstimators

The CrossShardOptimizer should be 100% unnecessary if you're using TF 2.x for the reasons @hthu describes. I am extremely willing to believe that our Colab file is out of date, but I am highly skeptical of the CrossShardOptimizer wrapper.

All patches should be based in TF 2.x. TF 1.x is going away and that all code that only runs in TF 1.x will break in the near future.

StrangeTcy commented 3 years ago

@hthu, I've just re-checked and

my notebook is based on the example notebook
I've just switched to tensorflow 2.x, leaving other parts of the code as they are

This results in the CrossShardOptimizer error.

Just to clarify, I didn't write this code, I'm just trying to use it & make it work.

So, I think that MTF is being used under the hood, but somehow -- due to it being an old version of mtf or mtf itself lacking the defintions necessary or something from the mtf source code being redefined here -- it doesn't seem to produce an optimizer ready to be run on a TPU.

hthu commented 3 years ago

What mesh_shape are you using? You might want at least 2x2 to start with.

StrangeTcy commented 3 years ago

@hthu I'm using all:8

ETA: in the other thread you say that the idea is to have no need to wrap anything. Which I get, and agree with. So now I'm trying to figure out where the error originates.

hthu commented 3 years ago

Do you have the latest stacktrace? You probably shouldn't see any TpuEstimator error message or you're still in 1.x realm?

StrangeTcy commented 3 years ago

@hthu

2021-02-15 11:04:26.763273: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-15 11:04:26.763406: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Current step 0
Downloading: 100% 1.04M/1.04M [00:00<00:00, 5.93MB/s]
Downloading: 100% 456k/456k [00:00<00:00, 3.22MB/s]
Downloading: 100% 1.36M/1.36M [00:00<00:00, 7.61MB/s]
Saving config to /content/drive/MyDrive/GPT2_colab_trial
2021-02-15 11:04:30.923993: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-15 11:04:30.949992: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-15 11:04:31.006147: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-02-15 11:04:31.006204: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (17c0abb61b15): /proc/driver/nvidia/version does not exist
2021-02-15 11:04:31.226784: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
Done!
params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7ff1a9057598>, {'n_head': 32, 'n_vocab': 50260, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0.1, 'train_batch_size': 256, 'attn_dropout': 0, 'train_steps': 572300, 'eval_steps': 0, 'predict_steps': 1, 'res_dropout': 0, 'eval_batch_size': 64, 'predict_batch_size': 1, 'iterations': 100, 'n_embd': 2048, 'datasets': ['nihexporter'], 'model': 'GPT', 'model_path': '/content/drive/MyDrive/GPT2_colab_trial', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global'], 'mesh_shape': 'all:8', 'layout': 'intermediate_expanded:x,heads:x,vocab:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 2048, 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'nihexporter': {'path': '/content/drive/MyDrive/openwebtext_tokenized/*.tfrecords', 'eval_path': '', 'n_vocab': 50256, 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 8, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': True, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 500, 'predict': False, 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False})
Using config: {'_model_dir': '/content/drive/MyDrive/GPT2_colab_trial', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.104.141.194:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.104.141.194:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.104.141.194:8470', '_evaluation_master': 'grpc://10.104.141.194:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=8, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu.tpu_cluster_resolver.TPUClusterResolver object at 0x7ff1a8ff1240>}
_TPUContext: eval_on_tpu True
Querying Tensorflow master (grpc://10.104.141.194:8470) for TPU system metadata.
2021-02-15 11:04:31.237247: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:373] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
Initializing TPU system (master: grpc://10.104.141.194:8470) to fetch topology for model parallelism. This might take a while.
Found TPU system:
*** Num TPU Cores: 8
*** Num TPU Workers: 1
*** Num TPU Cores Per Worker: 8
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, -5884039716630823473)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, -3837800863408535284)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 3084760653800097538)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -1421254968311746768)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, -5123767212799155974)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 1549024653082208868)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 5930724887607135313)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, -4494021334020959541)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 1089526645275849123)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 1187669109324654925)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, -3269809841973767338)
From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Calling model_fn.
WARNING:root:Changing batch size with sequential_input() will result in some data being skipped or repeated. Please ensure your batch size stays constant throughout training.
num_cores_per_replica: 1
computation_shape: [1, 1, 1, 1]
num_replicas: 8
device_assignment.topology.device_coordinates: [[[0 0 0 0]
  [0 0 0 1]
  [1 0 0 0]
  [1 0 0 1]
  [0 1 0 0]
  [0 1 0 1]
  [1 1 0 0]
  [1 1 0 1]]]
device_assignment.core_assignment: [[[0 0 0 0]]

 [[0 0 0 1]]

 [[1 0 0 0]]

 [[1 0 0 1]]

 [[0 1 0 0]]

 [[0 1 0 1]]

 [[1 1 0 0]]

 [[1 1 0 1]]]
2021-02-15 11:04:43.111549: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
device_list = ['/job:worker/task:0/device:CPU:0']
SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', '']
SimdMeshImpl init: Shape[all=8] LayoutRules{('heads', 'x'), ('embd', 'y'), ('intermediate_expanded', 'x'), ('vocab', 'x'), ('memory_length', 'y')}
Device Assignment: <tensorflow.python.tpu.device_assignment.DeviceAssignment object at 0x7ff17ebca240>
serialize_num_microbatches: tokens_per_microbatch_per_replica=2048 batch_dim=Dimension(name='batch', size=256) sequence_length={'inputs': 2048, 'labels': 2048} batch_per_replica=256 num_microbatches=256

N TRAINABLE VARS:
1,315,581,952

ALL DIM NAMES:
intermediate_expanded
heads
embed_sequence
embd
vocab

### lots and lots 
## and lots of model details

From /content/GPTNeo/utils.py:284: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
tf_update_ops: [<tf.Operation 'assign_948/group_deps_154' type=NoOp>, <tf.Operation 'assign_949/group_deps_308' type=NoOp>, <tf.Variable 'AssignAddVariableOp' shape=() dtype=int64>]
Create CheckpointSaverHook.
Bypassing TPUEstimator hook
training_loop marked as finished
Reraising captured error
Traceback (most recent call last):
  File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 230, in main
    estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3130, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3386, in _model_fn
    _validate_tpu_training_graph(ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3817, in _validate_tpu_training_graph
    'CrossShardOptimizer must be used for model training on TPUs.')
ValueError: CrossShardOptimizer must be used for model training on TPUs.

That's what happens if I try using tensorflow 2.x in a colab notebook. TPUEstimator hook gets bypassed for some reason I didn't really look into.

StellaAthena commented 3 years ago

Hi! Thanks for the feedback. We pushed some updates including a totally revamped notebook, give it a try and let me know if you're still getting this error.

EleutherAI / gpt-neo

CrossShardOptimizer must be used for model training on TPUs #120