EleutherAI / gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
https://www.eleuther.ai
MIT License
8.2k stars 945 forks source link

training on TPU v2.8-512 #184

Closed riccardo247 closed 3 years ago

riccardo247 commented 3 years ago

Hi, I am using checkpoint from https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/. I want to further train on my dataset. Can you give a working configuration for 512 TPU? Are you using main.py to train? My config is below. tried different combinations for mesh x,y but there are errors. In particular prediction is working but training on 512 is not. I trid to train on v2.8-8 with mesh x=4,y=2 batch size=8 and it is working. Cannot se what is going wrong. Thanks

1)In prediction with command: python3 main.py --predict --prompt prompt4.txt --tpu tpu_512 --model GPT_XL I see 1.3B paramters and output is good as well 2)When i am training with command: python3 main.py --model GPT_XL --steps_per_checkpoint 25000 --tpu tpu_512 I see in output:

SimdMeshImpl init: Shape[x=256, y=2] LayoutRules{('batch', 'x'), ('embd', 'y'), ('memory_length', 'y')} Device Assignment: <tensorflow.python.tpu.device_assignment.DeviceAssignment object at 0x7fef48506f60> serialize_num_microbatches: tokens_per_microbatch_per_replica=4096 batch_dim=Dimension(name='batch', size=512) sequence_length={'inputs': 2048, 'labels': 2048} batch_per_replica=2 num_microbatches=1

N TRAINABLE VARS: 1,315,575,808 <-correct .... variables: 3.95e+09
variables/trainable: 1.32e+09 variables/untrainable: 2.63e+09
.... Original stack trace for 'OutfeedDequeueTuple_440': File "main.py", line 257, in main(args) File "main.py", line 231, in main estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train saving_listeners=saving_listeners) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default self.config) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn config) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3277, in _model_fn host_ops = host_call.create_tpu_hostcall() File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2261, in create_tpu_hostcall device_ordinal=ordinal_id) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/ops/gen_tpu_ops.py", line 3455, in outfeed_dequeue_tuple device_ordinal=device_ordinal, name=name) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3536, in _c reate_op_internal op_def=op_def) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in init self._traceback = tf_stack.extract_stack()

training_loop marked as finished Reraising captured error Traceback (most recent call last): File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call return fn(*args) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn target_list, run_metadata) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: Unknown: From /job:worker/replica:0/task:33: Bad hardware status: Queue is closed. [[{{node while/InfeedQueue/enqueue/2}}]] Unknown: From /job:worker/replica:0/task:0: Bad hardware status: Queue is closed. [[{{node while/InfeedQueue/enqueue/440}}]

I am using GPT_XL.json { "n_head" : 16, "n_vocab" : 50257, "embed_dropout" : 0, "lr" : 0.0002, "lr_decay" : "cosine", "warmup_steps" : 3000, "beta1" : 0.9, "beta2" : 0.95, "epsilon" : 1e-08, "opt_name" : "adam", "weight_decay" : 0, "train_batch_size" : 512, "attn_dropout" : 0, "train_steps" : 600000, "lr_decay_end" : 300000, "eval_steps" : 10, "predict_steps" : 0, "res_dropout" : 0, "eval_batch_size" : 256, "predict_batch_size" : 256, "iterations" : 500, "n_embd" : 2048, "datasets" : [["...", null, null, null]], "model_path" : "my data...", "n_ctx" : 2048, "n_layer" : 24, "scale_by_depth" : true, "scale_by_in" : false, "attention_types" : [[["global", "local"], 12]], "mesh_shape" : "x:256,y:2", "layout" : "batch:x,memory_length:y,embd:y", "activation_function" : "gelu", "recompute_grad" : true, "gradient_clipping" : 1.0, "tokens_per_mb_per_replica" : 4096, "precision" : "bfloat16", "padding_id" : 50257, "eos_id" : 50256 }

riccardo247 commented 3 years ago

Using mesh x=128 and y=4 is working. So model cannot fit with y=2 . I guess 1,315,575,808 * 4 =5.26GB is to much for 8GB TPU? Do you have same result or I am doing something wrong? thanks

StellaAthena commented 3 years ago

I was going to recommend changing the mesh shape, but I'm glad you figured that out by yourself :)