dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
8.25k stars 417 forks source link

Regex masking dimension compatibility #213

Closed SamDuffield closed 1 year ago

SamDuffield commented 1 year ago

I get a tensor size mismatch for the regex guided generation here. Although this might be model specific

model = models.transformers("togethercomputer/RedPajama-INCITE-Instruct-3B-v1")
text.generate.regex(model, r"([0-9])")('"1+1=2" Give this text a score out of 10:')

#   return logits + mask
#          ~~~~~~~^~~~~~
# RuntimeError: The size of tensor a (50432) must match the size of tensor b (50277) at non-singleton dimension 1
rlouf commented 1 year ago

I'll investigate this week. Either there's something special with the llama model or with its tokenizer.

arunpatro commented 1 year ago

Llama Models work fine. It could be because of GPTNeoXTokenizer which redpajama and mpt models share.

xaviruvpadhiyar98 commented 1 year ago

Hi, yes this indeed is an error related to GPTNeoXTokenizer, Findings with pythia-70m-deduped Vocab size is 50304 (mentioned in original config) in https://huggingface.co/EleutherAI/pythia-70m-deduped/blob/main/config.json Vocab size is 50254 (when we run - tokenizer.vocab_size from AutoTokenizer) Vocab size is 50277 (when we run - len(tokenizer.get_vocab()) from AutoTokenizer) Logit size is 50304 (when we call self.model(...) )

Now for GPT2 Vocab size and Logit size is same 50257

Simple fix would be to replace create_proposal in regex.py from -

            mask = torch.full(
                (len(self.model.tokenizer.vocabulary),), -math.inf, device=self.device
            )

to

            mask = torch.full(
                (len(logits[0]),), -math.inf, device=self.device
            )

I ran for both gpt2 and pythia both worked

rlouf commented 1 year ago

Thank you! Would be awesome if you could open a PR. But we need to understand where the discrepancy comes from before using len(logits[0]) imo.

xaviruvpadhiyar98 commented 1 year ago

from what I understood from the code, mask is created with same length of vocabulary, which doesnt really work in case of GPTNeoX which has different vocab at all level, Also we can see that at the end of the function, we return return logits + mask And according to the error, we want to have Tensors of equal length, so mask should be of equal length as logits not vocab (if I understand it correctly) Also if anyone can test this change against LLama1-2 it would be great.

        for pstate in self.pstates:
            mask = torch.full(
                (len(self.model.tokenizer.vocabulary),), -math.inf, device=self.device
            )

I'll open a PR for small change if everyone's okay

xaviruvpadhiyar98 commented 1 year ago

Created a PR for the fix https://github.com/normal-computing/outlines/pull/236