Tensor size mismatch - Githubissues

THU-BPM / Robust_Watermark

Code and data for paper "A Semantic Invariant Robust Watermark for Large Language Models" accepted by ICLR 2024.

https://arxiv.org/abs/2310.06356

25 stars 3 forks source link

Tensor size mismatch #9

Open kirudang opened 3 months ago

kirudang commented 3 months ago

Hello,

I generated mapping for facebook/opt-6.7b model, which has vocab size of 50265. But when I continue to generate the watermark output, it shows this error.

  File "/network/rit/lab/Lai_ReSecureAI/phung/evaluation/watermark.py", line 232, in __call__
    scores = self._bias_logits(scores=scores, batched_bias=batched_bias, greenlist_bias=self.watermark_base.delta)
  File "/network/rit/lab/Lai_ReSecureAI/phung/evaluation/watermark.py", line 222, in _bias_logits
    scores = scores + batched_bias_tensor*greenlist_bias
RuntimeError: The size of tensor a (50272) must match the size of tensor b (50265) at non-singleton dimension 1

Please note that the code works well with: Llama 2, Mistral 7B and Falcon 7B. However, it has issues with OPT 6.7

exlaw commented 3 months ago

Thank you for your attention to our work. The issue lies with facebook/opt-6.7b, where the dimensions of the tokenizer and model.lm_head are different.

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "facebook/opt-6.7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

vocab_size = len(tokenizer)
print(vocab_size) # 50265

lm_head = model.lm_head
print(lm_head.weight.shape[0]) # 50272

If we use the actual dimensions of model.lm_head entirely, this issue should not exist.

kirudang commented 3 months ago

Hello there,

Oh i got it now. In order to make OPT work, i resize the model vocab to match with the tokenizer.

model_name = "facebook/opt-6.7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Resize model embeddings to match tokenizer's vocabulary size
model.resize_token_embeddings(len(tokenizer))

# Verify the sizes match now
vocab_size = len(tokenizer)
lm_head = model.lm_head
print(vocab_size)  # Should match the tokenizer's vocab size #50265
print(lm_head.weight.shape[0])  # Should now match the tokenizer's vocab size #50265

Thank you for your prompt return.