ConnorJL / GPT2

An implementation of training for GPT2, supports TPUs
MIT License
1.42k stars 334 forks source link

Predicting with PrettyBigModel `InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)` #4

Closed pkmital closed 5 years ago

pkmital commented 5 years ago

Hi, I was interested in testing your PrettyBig model. I've downloaded the model and edited the PrettyBig.json to point to the downloaded encoder and model paths. When running:

python3 main.py --model PrettyBig.eval.json --predict_text "Hello there! My name is"

I get the following error:

{'n_head': 16, 'encoder_path': '/Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/encoder', 'n_vocab': 50257, 'embed_dropout': 0.0, 'lr': 0.00025, 'warmup_steps': 2000, 'weight_decay': 0.01, 'beta1': 0.9, 'beta2': 0.98, 'epsilon': 1e-09, 'opt_n
ame': 'adam', 'train_batch_size': 256, 'attn_dropout': 0.0, 'train_steps': 10000, 'eval_steps': 10, 'max_steps': 604800, 'data_path': 'gs://connors-datasets/openwebtext/', 'scale': 0.14433756729740646, 'res_dropout': 0.1, 'predict_batch_s
ize': 1, 'eval_batch_size': 256, 'iterations': 100, 'n_embd': 1024, 'input': 'openwebtext_longbiased', 'model': 'GPT2', 'model_path': '/Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/PrettyBig', 'n_ctx': 1024, 'predict_path': 'logs/prediction
s_SortaBig.txt', 'n_layer': 25, 'use_tpu': False, 'precision': 'float32'}
Using config: {'_model_dir': '/Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/PrettyBig', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': , '_keep_checkpo
int_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x13fbf8ef0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}                                                                                                                                                                                                                          Generating predictions...
From /Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.                      Instructions for updating:                                                                                                                                                                                                                    Colocations handled automatically by placer.                                                                                                                                                                                                  Calling model_fn.
From /Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/models/gpt2/sample.py:57: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
From /Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/models/gpt2/sample.py:59: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
Done calling model_fn.
Graph was finalized.
2019-06-08 15:55:47.498527: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
From /Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Restoring parameters from /Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/PrettyBig/model.ckpt
Running local_init_op.
Done running local_init_op.
Traceback (most recent call last):
  File "/Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)
         [[{{node sample_sequence/while/model/GatherV2_1}}]]
python3 --version                                                                                                                                                                                                                      
Python 3.6.8 :: Anaconda, Inc.
pip3 list | grep tensorflow
mesh-tensorflow                    0.0.5
tensorflow                         1.13.1
tensorflow-datasets                1.0.1
tensorflow-estimator               1.13.0
tensorflow-metadata                0.13.0
tensorflow-probability             0.6.0

Any ideas appreciated. Thanks!

tbfly commented 5 years ago

Maybe you can try tensorflow==1.13.0 as: https://medium.com/@NPCollapse/replicating-gpt2-1-5b-86454a7f26af hints:

Tensorflow (I was using version 1.13) is…not perfect.

ConnorJL commented 5 years ago

This is a known bug. I haven't yet had the time to track down the exact cause. Three things you can try are setting the precision to float32, use a GPU instead of a CPU or change the "train_batch_size" and "predict_batch_size" parameters to 1. Some of these seem to fix it sometimes. I will fix this bug when I have the time to actually track down its source.

The bug also shouldn't happen if you predict with a single word.

kyb3r commented 5 years ago

I got the same error, here is the full output and traceback I got: https://hasteb.in/wilupika.py

Maybe it will be helpful :)

minimaxir commented 5 years ago

I encountered the same issue working on gpt-2-simple: https://github.com/minimaxir/gpt-2-simple/issues/38

The solution was to subtract the length of the prefix tokens from the maximum length to prevent OOB.

ConnorJL commented 5 years ago

Thanks minimaxir! I've implemented that fix now and think everything should be working. If this problem crops up again for anyone, feel free to reopen this issue.