Can't infer on the provided Colab

JanPokorny commented 3 years ago

In the provided Colab (only using provided cells), after downloading a pre-trained GPT3_XL, I tried to infer from it, which resulted in the following output from the very last cell:

out.txt

The interesting part seems to be:

Starting infeed thread controller.
Starting outfeed thread controller.
Initialized dataset iterators in 0 seconds
Before copy master to slices.
Done with copy master to slices.
Enqueue next (1) batch(es) of data to infeed.
Dequeue next (1) batch(es) of data from outfeed.
Outfeed finished for iteration (0, 0)
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:worker/replica:0/task:0:
DisableableBlockingRefcount is disabled.
     [[node OutfeedDequeueTuple_7 (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:2261) ]]

Original stack trace for 'OutfeedDequeueTuple_7':
  File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 184, in main
    handle_pred_output_fn(predictions, logger, enc, params, out_name=f"predictions_{args.sacred_id}_{current_step}")
  File "/content/GPTNeo/inputs.py", line 165, in handle_pred_output
    for i, p in enumerate(predictions):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3167, in predict
    yield_single_examples=yield_single_examples):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 613, in predict
    self.config)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3525, in _model_fn
    host_call_ret = host_calls.create_tpu_hostcall()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2261, in create_tpu_hostcall
    device_ordinal=ordinal_id)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_tpu_ops.py", line 3455, in outfeed_dequeue_tuple
    device_ordinal=device_ordinal, name=name)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Graph was finalized.
Restoring parameters from gs://peppa-test-1/GPT3_XL/model.ckpt-362000
Closing session due to error From /job:worker/replica:0/task:0:
9 root error(s) found.
  (0) Resource exhausted: Failed to allocate request for 1.0KiB (1024B) on device ordinal 3
     [[{{node ConstantFolding/split-folded-3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[ConstantFolding/split-folded-4_G4895]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

...followed by many more similar OOM errors.

I'd be glad for any help with running the inference in Google Colab. Training actually seems to work and saves a new checkpoint, but I have not been able to run inference even on the provided pre-trained network.

StellaAthena commented 3 years ago

Thanks for trying the code out! There was a problem with the way the configs were set up which I believe is now fixed (at least, it works for me). Google can be finicky about the amount of compute it gives you, so try fiddling with the settings in "Modify config for colab" or try the 1.3B model instead of the 2.7B model if you're still getting OOMs

LeCongThuong commented 3 years ago

Hi @StellaAthena, thank for sharing your great work! I trained GPT3_XL with TPU Colab with my custom dataset. It was ok, but in inference, I also had the same problem like @JanPokorny's error. I think that training is always costs more resources than inference. Why inference caused OOM while training did not cause. Any advices!

JanPokorny commented 3 years ago

@StellaAthena Inference in the new notebook works for me, thanks!

@LeCongThuong Try starting with a fresh notebook, the updated version worked for me as-is -- I just had to do the google auth and enter my bucket URL, then I was able to download and infer from GPT3_XL.

LeCongThuong commented 3 years ago

Thanks @JanPokorny, I tried to restart runtime Colab but it did not work. Next, I tried to update Colab to Colab pro and it worked. So as @StellaAthena said, the problem lie on resources Google gives us, not the code repositories.

JanPokorny commented 3 years ago

@LeCongThuong I was confused since I was able to train, but not to infer, so I suspected that the OOM was caused by a different underlying cause. (Also the RAM bar didn't show usage in the Colab UI. Which it apparently doesn't for TPUs.) With the updated notebook I'm able to fine-tune GPT3_XL and infer from the fine-tuned model even on the free tier.

EleutherAI / gpt-neo

Can't infer on the provided Colab #148