facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.45k stars 722 forks source link

Cannot find end-of-document symbol (</s>) in tokenizer #128

Closed xhluca closed 2 years ago

xhluca commented 2 years ago

Following #19, I was able to download the correct files and tried to run OPT-2.7b. I tried the following command to boot the API:

python -m metaseq_cli.interactive_hosted

However, I ran into the following problem when trying to run the API:

2-05-31 23:29:16 | INFO | metaseq_cli.interactive | Local checkpoint copy already exists, skipping copy
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 298, in <module>
    cli_main()
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 294, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/toolkit/opt/metaseq/metaseq/distributed/utils.py", line 263, in call_main
    return main(cfg, **kwargs)
  File "/home/toolkit/opt/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/toolkit/opt/metaseq/metaseq/hub_utils.py", line 471, in load_model
    task = tasks.setup_task(self.cfg.task)
  File "/home/toolkit/opt/metaseq/metaseq/tasks/__init__.py", line 46, in setup_task
    return task.setup_task(cfg, **kwargs)
  File "/home/toolkit/opt/metaseq/metaseq/tasks/language_modeling.py", line 188, in setup_task
    return cls(args)
  File "/home/toolkit/opt/metaseq/metaseq/tasks/language_modeling.py", line 156, in __init__
    args.end_of_document_symbol
AssertionError: Cannot find end-of-document symbol (</s>) in tokenizer
xhluca commented 2 years ago

@hunterlang have you run into this issue?

xhluca commented 2 years ago

I just ran the md5sum on the files and got this:

$ md5sum gpt2*
75a37753dd7a28a2c5df80c28bf06e4e  gpt2-merges.txt
d9f1a1235d1390093ade7a7e05d11190  gpt2-vocab.json

However, it wasn't what I expected: https://github.com/facebookresearch/metaseq/issues/81#issuecomment-1122617595

I obtained the gpt2-vocab.json file from here: https://github.com/huggingface/swift-coreml-transformers/blob/master/Resources/gpt2-vocab.json

So i'm guessing I should be getting it from sonewhere else. Is there any reason why the gpt2-vocab.json file is not included in this repo? it would make the process easier.

Skyy93 commented 2 years ago

Do not use the gpt2-vocab.json from there.

You can find the correct vocab and merges here: https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/assets

xhluca commented 2 years ago

Thank you @Skyy93 .

@stephenroller @hunterlang Would it be possible to edit #19's issue body with the correct link? This will save from hours of debugging for those that find the gpt2-vocab.json from #19

EDIT: So I read the instructions again and it was indeed mentioned:

Note that the gpt2-merges.txt and gpt2-vocab.json files in projects/OPT/assets/ will need to be moved to the corresponding directories defined in the constants.py file.

I've created a PR that adds instructions on downloading them.

stephenroller commented 2 years ago

Thank you both @Skyy93 for the community help and @xhlulu for improving documentation as a response.