allenai / open-instruct

Apache License 2.0
1.08k stars 140 forks source link

Something strange with Instruct model tokenization #132

Closed y12uc231 closed 3 months ago

y12uc231 commented 3 months ago

🐛 Describe the bug

Here is the code I am running. The goal is to get logprob for each token generated by the chat model.

olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct")
prompt = tokenizer.apply_chat_template(chat, tokenize=False,
                                       add_generation_prompt=True)
inputs = tokenizer.encode(prompt, add_special_tokens=True,
                          return_tensors="pt").to(device)
output = olmo.generate(input_ids=inputs.to(olmo.device),
                       max_new_tokens=10,
                       do_sample=True,
                       top_k=50,
                       top_p=0.95,
                       return_dict_in_generate=True,
                       output_scores=True)
transition_scores = olmo.compute_transition_scores(
            output.sequences, output.scores, normalize_logits=True)

Here is the error when I run the code above.

Traceback (most recent call last):
  File "/n/holylabs/LABS/doshi-velez_lab/Users/skrishna/w2s/self_loop_llm/src/olma.py", line 307, in <module>
    api_loop_call(args, start_prompts, prefix_prompts[args.data_name][args.prefix],  self_correct_prompt, get_test_data(args.data_name, dataset), few_shot_prompt)
  File "/n/holylabs/LABS/doshi-velez_lab/Users/skrishna/w2s/self_loop_llm/src/olma.py", line 181, in api_loop_call
    response = get_llm_prediction_with_logits(prompt, temperature = args.temperature, large_model=args.llm)
  File "/n/holylabs/LABS/doshi-velez_lab/Users/skrishna/w2s/self_loop_llm/src/olma.py", line 88, in get_llm_prediction_with_logits
    transition_scores = olmo.compute_transition_scores(
  File "/n/home02/skrishna/.conda/envs/pt2.1.0_cuda12.1/lib/python3.10/site-packages/transformers/generation/utils.py", line 1235, in compute_transition_scores
    scores = scores.reshape(-1, self.config.vocab_size, scores.shape[-1])
RuntimeError: shape '[-1, 50280, 10]' is invalid for input of size 503040

Here is where the weird part : the size of the output.scores[0] should be [1, vocab_size] where for olmo vocab_size = 50280 but the size of output.scores[0] = [1, 50304] . How come the outcome is not aligned with the vocab_size. Also the value of outcome.scores is mostly -infs.

Versions

Python 3.10.13

hamishivi commented 3 months ago

Hi, I believe that Olmo's vocab size != it's embedding size (see https://huggingface.co/allenai/OLMo-7B/blob/main/config.json). In general, this can sometimes happen when model makers want to leave extra space for new tokens, or to pad out to an even size that slightly improves training.

As for why the scores are mostly -infs, I'm not sure. It'd probably be a good idea to open an issue in the transformers repository or the core olmo repository - this repo is about instruction-tuning models, rather than the specifics of generating from olmo.