EleutherAI / gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
https://www.eleuther.ai
MIT License
8.21k stars 946 forks source link

Can't generate samples from pre-trained GPT3_XL using main.py without errors #163

Closed texturejc closed 3 years ago

texturejc commented 3 years ago

Describe the bug When I run the provided colab notebook so as to sample from a pre-trained model, I get this error:

FileNotFoundError: [Errno 2] No such file or directory: 'configs/GPT3_XL.json'

To Reproduce When I go through the steps to generate samples from a pre-trained model without fine-tuning, I do fine until I try to generate the predictions. Specifically, when I run

!main.py --model $pretrained_model --steps_per_checkpoint 500 --tpu colab --predict --prompt example_prompt.txt

I get an error:

python3: can't open file 'main.py': [Errno 2] No such file or directory

Specifying the full filepath for main.py solves this, but I then need to manually install mesh_tensorflow, tokenizers, and ftfy with pip, which I presumably shouldn't have to do? Once this has been done, I run the cell and get an error:

FileNotFoundError: [Errno 2] No such file or directory: 'configs/GPT3_XL.json'

This is no surprise, as this directory does not contain this .json file. But I don't know how to proceed from here, or know what I'm doing wrong to get this result?

Expected behavior I had expected to be able to generate predictions from the pre-trained GPT3_XL model on the basis of the text prompts supplied.

Screenshots

Screenshot 2021-03-23 at 22 42 40

Environment (please complete the following information):

StellaAthena commented 3 years ago

It looks like the working directory is incorrectly set. Try restarting your runtime and following the steps again, in order. If you get the same error, try

import os
os.chdir('/content/gpt-neo')

if that doesn’t fix the problem, run !ls and !pwd and post the results here.

texturejc commented 3 years ago

Thanks for getting back to me. I've restarted the runtime and done everything in order and got the same error.

run !ls and !pwd and post the results here.

The results of this are as follows:

!ls

CODEOWNERS             inputs.py      requirements.txt
configs                LICENSE        run_experiment.py
configs.py             logs       sample.py
data                   main.py        scripts
docker-compose.yml         model_fns.py   tasks.py
Dockerfile             models         test_models.py
encoders.py            optimizers.py  utils.py
export.py              __pycache__
GPTNeo_example_notebook.ipynb  README.md

!pwd

/content/gpt-neo

Any further help appreciated!

khomchyk commented 3 years ago

Can confirm the issue. Getting the following error using pretrained model via google colab.

Done calling model_fn.
TPU job name worker
Graph was finalized.
Restoring parameters from /content/GPTNeo/the-eye.eu/public/AI/gptneo-release/GPT3_XL/model.ckpt-362000
prediction_loop marked as finished
Reraising captured error
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on /content/GPTNeo/the-eye.eu/public/AI/gptneo-release/GPT3_XL/model.ckpt-362000: Unimplemented: File system scheme '[local]' not implemented (file: '/content/GPTNeo/the-eye.eu/public/AI/gptneo-release/GPT3_XL/model.ckpt-362000')
     [[{{node save/RestoreV2_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 1298, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
GoMapur commented 3 years ago

My guess: the name of the json is gpt3_XL_256_Pile.json not gpt3_XL if you open configs folder and look at the names of jsons

StellaAthena commented 3 years ago

My guess: the name of the json is gpt3_XL_256_Pile.json not gpt3_XL if you open configs folder and look at the names of jsons

That was my thinking, but it shouldn’t be a problem if they followed the notebook. It also doesn’t explain why main.py can’t be found.

texturejc commented 3 years ago

That was my thinking, but it shouldn’t be a problem if they followed the notebook. It also doesn’t explain why main.py can’t be found.

One possible reason for this is that I assumed I'm meant to clone the repo into the colab so as to access the scripts like main.py. This isn't specified in the instructions, so I wonder if some other protocol is intended instead?

StellaAthena commented 3 years ago

That was my thinking, but it shouldn’t be a problem if they followed the notebook. It also doesn’t explain why main.py can’t be found.

One possible reason for this is that I assumed I'm meant to clone the repo into the colab so as to access the scripts like main.py. This isn't specified in the instructions, so I wonder if some other protocol is intended instead?

You do not need to modify the notebook (other than the couple places that you're instructed to). The setup button runs the command

%tensorflow_version 2.x
!git clone https://github.com/EleutherAI/GPTNeo
%cd GPTNeo
!pip3 install -q -r requirements.txt
pretrained_model = None
dataset = None

It is very possible that adding a second clone command is what's causing problems.

texturejc commented 3 years ago

OK, very good. I didn't see that the setup cell had code in it so I didn't run it. Running that first solves the problem: main.py runs fine, and if gpt3_XL_256_Pile.json is renamed to gpt3_XL then the script executes.

EDIT: The problem below was solved quickest by creating a new bucket. No response needed.

I appreciate that this isn't a place I should ask for support for using Google Cloud, but I am getting a permissions error when I run main.py:

        "message": "my_email@gmail.com does not have storage.objects.get access to the Google Cloud Storage object.",
        "domain": "global",
        "reason": "forbidden"

I've tried changing the bucket permissions to Storage Admin and Storage Object Viewer for allusers, but the error is persistent. Does anyone maybe have a quick fix?

royherma commented 3 years ago

@StellaAthena I also ran into this problem and the confusion was caused because the "setup" block is hidden/easy to miss when navigating from github readme page to colab.

I think that highlighting the colab (which is awesome!!) on the github readme as well as somehow highlighting the "setup" part on the colab will help mitigate issues like this in the future.

StellaAthena commented 3 years ago

@royherma Thank you for the feedback. What does it look like in your mind to “highlight” the collab notebook? In what way is the current version insufficient?

![Uploading 2F881C4D-7478-4BE0-914C-92B7F1C74255.png…]()

royherma commented 3 years ago

It can be anything from a big BOLD title to emojis or anything that moves attention of the reader/dev towards making sure they complete the setup section first.

Additionally, if possible to force the setup section as "expanded" that would also be useful.

Let me know if that makes sense or not.

royherma commented 3 years ago

Also @StellaAthena, somewhat related to issue/solution - but here is another example of possibly confusing instructions.

Screen Shot 2021-05-01 at 14 33 29

In above photo, can see instructions say "you can skip this section".

However, in the line below under "Sampling only", it states that you should selected this option and then move to the "Pretrained model section".

Not selecting and running that section was one of the reason i ran into failures when running the generation on the pretrained model.

I hope I'm not annoying about it but just think these few small tweaks would help make the colab more clear to devs like myself :)