VHellendoorn / Code-LMs

Guide to using pre-trained large language models of source code
MIT License
1.78k stars 247 forks source link

Missing vocab file #35

Closed michaelpradel closed 2 years ago

michaelpradel commented 2 years ago

I'm trying to generate code from a prompt, as described in the README:

./deepy.py generate.py configs/text_generation.yml configs/local_setup.yml  configs/small.yml

However, the command fails because the GTP2 vocab file is missing:

Traceback (most recent call last):
  File "generate.py", line 104, in <module>
    main()
  File "generate.py", line 33, in main
    model, neox_args = setup_for_inference_or_eval(use_cache=True)
  File "/home/m/polycoder/gpt-neox/megatron/utils.py", line 424, in setup_for_inference_or_eval
    neox_args.build_tokenizer()
  File "/home/m/polycoder/gpt-neox/megatron/neox_arguments/arguments.py", line 121, in build_tokenizer
    self.tokenizer = build_tokenizer(self)
  File "/home/m/polycoder/gpt-neox/megatron/tokenizer/tokenizer.py", line 40, in build_tokenizer
    tokenizer = _GPT2BPETokenizer(args.vocab_file, args.merge_file)
  File "/home/m/polycoder/gpt-neox/megatron/tokenizer/tokenizer.py", line 154, in __init__
    self.tokenizer = GPT2Tokenizer(
  File "/home/m/polycoder/gpt-neox/megatron/tokenizer/gpt2_tokenization.py", line 188, in __init__
    self.encoder = json.load(open(vocab_file))
FileNotFoundError: [Errno 2] No such file or directory: 'data/gpt2-vocab.json'

I'm not using Docker but have installed your fork of gpt-neox myself.

Where can I find the missing vocab file? -- Thanks in advance!

VHellendoorn commented 2 years ago

Oops, sorry I missed this issue. You need two files, that can both be found under Data/: code-merges.txt and code-vocab.json.