Closed SamDuffield closed 1 year ago
I'll investigate this week. Either there's something special with the llama model or with its tokenizer.
Llama Models work fine. It could be because of GPTNeoXTokenizer
which redpajama and mpt models share.
Hi, yes this indeed is an error related to GPTNeoXTokenizer, Findings with pythia-70m-deduped Vocab size is 50304 (mentioned in original config) in https://huggingface.co/EleutherAI/pythia-70m-deduped/blob/main/config.json Vocab size is 50254 (when we run - tokenizer.vocab_size from AutoTokenizer) Vocab size is 50277 (when we run - len(tokenizer.get_vocab()) from AutoTokenizer) Logit size is 50304 (when we call self.model(...) )
Now for GPT2 Vocab size and Logit size is same 50257
Simple fix would be to replace create_proposal in regex.py from -
mask = torch.full(
(len(self.model.tokenizer.vocabulary),), -math.inf, device=self.device
)
to
mask = torch.full(
(len(logits[0]),), -math.inf, device=self.device
)
I ran for both gpt2 and pythia both worked
Thank you! Would be awesome if you could open a PR. But we need to understand where the discrepancy comes from before using len(logits[0])
imo.
from what I understood from the code, mask is created with same length of vocabulary, which doesnt really work in case of GPTNeoX which has different vocab at all level,
Also we can see that at the end of the function, we return return logits + mask
And according to the error, we want to have Tensors of equal length, so mask should be of equal length as logits not vocab (if I understand it correctly)
Also if anyone can test this change against LLama1-2 it would be great.
for pstate in self.pstates:
mask = torch.full(
(len(self.model.tokenizer.vocabulary),), -math.inf, device=self.device
)
I'll open a PR for small change if everyone's okay
Created a PR for the fix https://github.com/normal-computing/outlines/pull/236
I get a tensor size mismatch for the regex guided generation here. Although this might be model specific