Closed riccardo247 closed 3 years ago
Using mesh x=128 and y=4 is working. So model cannot fit with y=2 . I guess 1,315,575,808 * 4 =5.26GB is to much for 8GB TPU? Do you have same result or I am doing something wrong? thanks
I was going to recommend changing the mesh shape, but I'm glad you figured that out by yourself :)
Hi, I am using checkpoint from https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/. I want to further train on my dataset. Can you give a working configuration for 512 TPU? Are you using main.py to train? My config is below. tried different combinations for mesh x,y but there are errors. In particular prediction is working but training on 512 is not. I trid to train on v2.8-8 with mesh x=4,y=2 batch size=8 and it is working. Cannot se what is going wrong. Thanks
1)In prediction with command: python3 main.py --predict --prompt prompt4.txt --tpu tpu_512 --model GPT_XL I see 1.3B paramters and output is good as well 2)When i am training with command: python3 main.py --model GPT_XL --steps_per_checkpoint 25000 --tpu tpu_512 I see in output:
SimdMeshImpl init: Shape[x=256, y=2] LayoutRules{('batch', 'x'), ('embd', 'y'), ('memory_length', 'y')} Device Assignment: <tensorflow.python.tpu.device_assignment.DeviceAssignment object at 0x7fef48506f60> serialize_num_microbatches: tokens_per_microbatch_per_replica=4096 batch_dim=Dimension(name='batch', size=512) sequence_length={'inputs': 2048, 'labels': 2048} batch_per_replica=2 num_microbatches=1
N TRAINABLE VARS: 1,315,575,808 <-correct .... variables: 3.95e+09
main(args)
File "main.py", line 231, in main
estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train
saving_listeners=saving_listeners)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
self.config)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
config)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3277, in _model_fn
host_ops = host_call.create_tpu_hostcall()
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2261, in create_tpu_hostcall
device_ordinal=ordinal_id)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/ops/gen_tpu_ops.py", line 3455, in outfeed_dequeue_tuple
device_ordinal=device_ordinal, name=name)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3536, in _c
reate_op_internal
op_def=op_def)
File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in init
self._traceback = tf_stack.extract_stack()
variables/trainable: 1.32e+09 variables/untrainable: 2.63e+09
.... Original stack trace for 'OutfeedDequeueTuple_440': File "main.py", line 257, in
training_loop marked as finished Reraising captured error Traceback (most recent call last): File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call return fn(*args) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn target_list, run_metadata) File "/home/ricgiac/gpt-neo/py37/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: Unknown: From /job:worker/replica:0/task:33: Bad hardware status: Queue is closed. [[{{node while/InfeedQueue/enqueue/2}}]] Unknown: From /job:worker/replica:0/task:0: Bad hardware status: Queue is closed. [[{{node while/InfeedQueue/enqueue/440}}]
I am using GPT_XL.json { "n_head" : 16, "n_vocab" : 50257, "embed_dropout" : 0, "lr" : 0.0002, "lr_decay" : "cosine", "warmup_steps" : 3000, "beta1" : 0.9, "beta2" : 0.95, "epsilon" : 1e-08, "opt_name" : "adam", "weight_decay" : 0, "train_batch_size" : 512, "attn_dropout" : 0, "train_steps" : 600000, "lr_decay_end" : 300000, "eval_steps" : 10, "predict_steps" : 0, "res_dropout" : 0, "eval_batch_size" : 256, "predict_batch_size" : 256, "iterations" : 500, "n_embd" : 2048, "datasets" : [["...", null, null, null]], "model_path" : "my data...", "n_ctx" : 2048, "n_layer" : 24, "scale_by_depth" : true, "scale_by_in" : false, "attention_types" : [[["global", "local"], 12]], "mesh_shape" : "x:256,y:2", "layout" : "batch:x,memory_length:y,embd:y", "activation_function" : "gelu", "recompute_grad" : true, "gradient_clipping" : 1.0, "tokens_per_mb_per_replica" : 4096, "precision" : "bfloat16", "padding_id" : 50257, "eos_id" : 50256 }