dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.28k stars 472 forks source link

Error in outlines.generate.choice: create_states_mapping throws ValueError: not enough values to unpack (expected 3, got 2) #585

Closed dnhkng closed 9 months ago

dnhkng commented 9 months ago

Describe the issue as clearly as possible:

When I try the examples on the github front page, some do not work from a fresh conda environment.

Steps/code to reproduce the bug:

import outlines

model = outlines.models.transformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda")

prompt = "1+1="
answer = outlines.generate.format(model, int)(prompt)

prompt = "sqrt(2)="

generator = outlines.generate.format(model, float)
answer = generator(prompt)

# answer is '2', a string, not a float!
# even worse:

model = outlines.models.transformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda")

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)

Expected result:

Either "Positive" or "Negative"

Error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 10
      1 model = outlines.models.transformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda")
      4 prompt = """You are a sentiment-labelling assistant.
      5 Is the following review positive or negative?
      6 
      7 Review: This restaurant is just awesome!
      8 """
---> 10 generator = outlines.generate.choice(model, ["Positive", "Negative"])
     11 answer = generator(prompt)

File ~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:412, in choice(model, choices, max_tokens, sampler)
    405 def choice(
    406     model,
    407     choices: List[str],
    408     max_tokens: Optional[int] = None,
    409     sampler: Sampler = multinomial,
    410 ):
    411     regex_str = r"(" + r"|".join(choices) + r")"
--> 412     return regex(model, regex_str, max_tokens, sampler)

File ~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:373, in regex(model, regex_str, max_tokens, sampler)
    367 def regex(
    368     model,
    369     regex_str: str,
    370     max_tokens: Optional[int] = None,
    371     sampler: Sampler = multinomial,
    372 ):
--> 373     fsm = RegexFSM(regex_str, model.tokenizer)
    375     device = model.device
    376     generator = SequenceGenerator(fsm, model, sampler, device, max_tokens=max_tokens)

File ~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:136, in RegexFSM.__init__(self, regex_string, tokenizer)
    131     final_states = regex_fsm.finals | {
    132         -1
    133     }  # Include the EOS token in final states
    134     return states_to_token_maps, empty_token_ids, final_states
--> 136 (
    137     self.states_to_token_maps,
    138     self.empty_token_ids,
    139     self.final_states,
    140 ) = create_states_mapping(
    141     regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    142 )
    143 self.num_tokens_generated = 0
    144 self.vocabulary = tokenizer.vocabulary.values()

ValueError: not enough values to unpack (expected 3, got 2)

Outlines/Python version information:

Version information

``` (command output here) ```

Context for the issue:

This is a very weird bug! If I run the "outlines.generate.format" code, very occasionally, I also get the "outlines.generate.choice" method to run too! But 99% of the time, I get this error.

I did some digging, and added some debug code:

        x = create_states_mapping(
            regex_string, tuple(sorted(tokenizer.vocabulary.items()))
        )
        print("Output tuple:")
        for item in x:
            print(f'{item=}')
        self.states_to_token_maps, self.empty_token_ids, self.final_states = x

When I run the working code, I see:

Output tuple:
item={0: {59: 3, 52: 3, 29945: 3, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 29953: 3, 48: 1, 46: 1, 29974: 1, 29899: 1, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 51: 2, 57: 3, 29941: 3, 29900: 2}, 1: {59: 3, 52: 3, 29945: 3, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 29953: 3, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 51: 2, 57: 3, 29941: 3, 29900: 2}, 2: {29872: 5, 72: 5, 29889: 4, 2: 2, 104: 5, 49: 4, 29923: 5}, 3: {59: 3, 29889: 4, 52: 3, 29945: 3, 49: 4, 72: 5, 104: 5, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 51: 3, 29953: 3, 29923: 5, 29872: 5, 29900: 3, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 2: 3, 57: 3, 29941: 3}, 4: {29955: 8, 55: 8, 29946: 8, 29906: 8, 57: 8, 29941: 8, 59: 8, 52: 8, 29945: 8, 54: 8, 29947: 8, 58: 8, 56: 8, 60: 8, 29896: 8, 29929: 8, 51: 8, 29953: 8, 53: 8, 29900: 8}, 5: {29899: 6, 48: 6, 29974: 6, 46: 6}, 6: {54: 7, 29947: 7, 56: 7, 29896: 7, 58: 7, 29929: 7, 60: 7, 29953: 7, 51: 7, 29900: 7, 29955: 7, 53: 7, 29946: 7, 55: 7, 29906: 7, 57: 7, 29941: 7, 59: 7, 52: 7, 29945: 7}, 7: {54: 7, 29947: 7, 56: 7, 29896: 7, 58: 7, 29929: 7, 60: 7, 29953: 7, 51: 7, 29900: 7, 29955: 7, 53: 7, 29946: 7, 55: 7, 29906: 7, 2: 7, 57: 7, 29941: 7, 59: 7, 52: 7, 29945: 7}, 8: {29955: 8, 55: 8, 29946: 8, 29906: 8, 2: 8, 57: 8, 29941: 8, 72: 5, 59: 8, 29923: 5, 52: 8, 29945: 8, 29872: 5, 54: 8, 29947: 8, 58: 8, 56: 8, 60: 8, 29896: 8, 29929: 8, 51: 8, 29953: 8, 104: 5, 53: 8, 29900: 8}}
item=set()
item=frozenset({2, 3, 7, 8, -1})
'1'

But the buggy code produces:

Output tuple:
item={0: {29925: 2, 81: 1, 29940: 1, 9135: 4, 83: 2, 8139: 10, 9837: 3}, 1: {387: 11, 2442: 5, 29872: 10, 104: 10}, 2: {8156: 5, 114: 3, 29877: 3, 359: 4}, 3: {29879: 4, 1039: 5, 118: 4}, 4: {29875: 5, 4812: 7, 3321: 9, 108: 5, 277: 6}, 5: {2034: 7, 119: 6, 29873: 6}, 6: {440: 8, 573: 9, 29875: 7, 108: 7}, 7: {29894: 8, 345: 9, 121: 8}, 8: {29872: 9, 104: 9}, 10: {106: 11, 3249: 5, 29887: 11, 28818: 6}, 11: {1230: 9, 271: 6, 2219: 7, 1926: 8, 29874: 5, 100: 5}}
item=set()
.
.
.
File [~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:149](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:149), in RegexFSM.__init__(self, regex_string, tokenizer)
    [147](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:147) for item in x:
    [148](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:148)     print(f'{item=}')
--> [149](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:149) self.states_to_token_maps, self.empty_token_ids, self.final_states = x
    [150](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:150) self.num_tokens_generated = 0
    [151](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:151) self.vocabulary = tokenizer.vocabulary.values()

ValueError: not enough values to unpack (expected 3, got 2)

So, the function "create_states_mapping" is not returning the frozenset, so the tuple only has 2 on the 3 items to unpack!

lapp0 commented 9 months ago

Could you please include your version info in the version section? There was a change recently which may have fixed this

python -c "from outlines import _version; print(_version.version)"
python -c "import sys; print('Python', sys.version)"
pip freeze

There's a good chance upgrading to latest (unreleased) 0.0.25 would fix this

pip install outlines git+https://github.com/outlines-dev/outlines
dnhkng commented 9 months ago

I was on 0.0.24 I can confirm that 0.0.25 fixes the issue with "outlines.generate.choice"...

But now "outlines.generate.format" throws the same kind of error!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], [line 3](vscode-notebook-cell:?execution_count=3&line=3)
      [1](vscode-notebook-cell:?execution_count=3&line=1) prompt = "sqrt(2)="
----> [3](vscode-notebook-cell:?execution_count=3&line=3) generator = outlines.generate.format(model, float)
      [4](vscode-notebook-cell:?execution_count=3&line=4) answer = generator(prompt)

File [~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:396](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:396), in format(model, python_type, max_tokens, sampler)
    [392](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:392) def format(
    [393](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:393)     model, python_type, max_tokens: Optional[int] = None, sampler: Sampler = multinomial
    [394](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:394) ):
    [395](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:395)     regex_str = python_types_to_regex(python_type)
--> [396](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:396)     return regex(model, regex_str, max_tokens, sampler)

File [~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:370](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:370), in regex(model, regex_str, max_tokens, sampler)
    [364](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:364) def regex(
    [365](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:365)     model,
    [366](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:366)     regex_str: str,
    [367](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:367)     max_tokens: Optional[int] = None,
    [368](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:368)     sampler: Sampler = multinomial,
    [369](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:369) ):
--> [370](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:370)     fsm = RegexFSM(regex_str, model.tokenizer)
    [372](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:372)     device = model.device
    [373](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/generate/api.py:373)     generator = SequenceGenerator(fsm, model, sampler, device, max_tokens=max_tokens)

File [~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:120](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:120), in RegexFSM.__init__(self, regex_string, tokenizer)
    [114](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:114)         raise ValueError(
    [115](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:115)             "The vocabulary does not allow us to build a sequence that matches the input regex"
    [116](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:116)         )
    [118](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:118)     return states_to_token_maps, empty_token_ids
--> [120](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:120) self.states_to_token_maps, self.empty_token_ids = create_states_mapping(
    [121](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:121)     regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    [122](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:122) )
    [123](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:123) self.vocabulary = tokenizer.vocabulary.values()
    [124](https://file+.vscode-resource.vscode-cdn.net/home/dnhkng/Documents/LLM/Frankenmerge/~/miniforge3/envs/frankenmerge/lib/python3.10/site-packages/outlines/fsm/fsm.py:124) self.eos_token_id = tokenizer.eos_token_id

ValueError: too many values to unpack (expected 2)

This is very weird, as I see that "create_states_mapping" should return only two objects: states_to_token_maps and empty_token_ids. But when I print what is returned, I see its 3 objects: ({0: {59: 3, 52: 3, 29945: 3, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 29953: 3, 48: 1, 46: 1, 29974: 1, 29899: 1, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 51: 2, 57: 3, 29941: 3, 29900: 2}, 1: {59: 3, 52: 3, 29945: 3, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 29953: 3, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 51: 2, 57: 3, 29941: 3, 29900: 2}, 2: {29872: 5, 72: 5, 29889: 4, 2: 2, 104: 5, 49: 4, 29923: 5}, 3: {59: 3, 29889: 4, 52: 3, 29945: 3, 49: 4, 72: 5, 104: 5, 54: 3, 29947: 3, 56: 3, 60: 3, 29896: 3, 58: 3, 29929: 3, 51: 3, 29953: 3, 29923: 5, 29872: 5, 29900: 3, 29955: 3, 55: 3, 53: 3, 29946: 3, 29906: 3, 2: 3, 57: 3, 29941: 3}, 4: {29955: 8, 55: 8, 29946: 8, 29906: 8, 57: 8, 29941: 8, 59: 8, 52: 8, 29945: 8, 54: 8, 29947: 8, 58: 8, 56: 8, 60: 8, 29896: 8, 29929: 8, 51: 8, 29953: 8, 53: 8, 29900: 8}, 5: {29899: 6, 48: 6, 29974: 6, 46: 6}, 6: {54: 7, 29947: 7, 56: 7, 29896: 7, 58: 7, 29929: 7, 60: 7, 29953: 7, 51: 7, 29900: 7, 29955: 7, 53: 7, 29946: 7, 55: 7, 29906: 7, 57: 7, 29941: 7, 59: 7, 52: 7, 29945: 7}, 7: {54: 7, 29947: 7, 56: 7, 29896: 7, 58: 7, 29929: 7, 60: 7, 29953: 7, 51: 7, 29900: 7, 29955: 7, 53: 7, 29946: 7, 55: 7, 29906: 7, 2: 7, 57: 7, 29941: 7, 59: 7, 52: 7, 29945: 7}, 8: {29955: 8, 55: 8, 29946: 8, 29906: 8, 2: 8, 57: 8, 29941: 8, 72: 5, 59: 8, 29923: 5, 52: 8, 29945: 8, 29872: 5, 54: 8, 29947: 8, 58: 8, 56: 8, 60: 8, 29896: 8, 29929: 8, 51: 8, 29953: 8, 104: 5, 53: 8, 29900: 8}}, set(), frozenset({2, 3, 7, 8, -1}))

Running the outlines.generate.choice method returns 2 objects correctly, the dictionary, and the set.

dnhkng commented 9 months ago

Maybe found a quick fix: commenting out the @cache seems to fix this!

i.e.

class RegexFSM(FSM):
    """FSM to generate text that is in the language of a regular expression."""

    def __init__(self, regex_string: str, tokenizer: "Tokenizer"):
        # @cache()
        def create_states_mapping(

Not sure how this affects performance though, but:

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])

for i in range(100):
    answer = generator(prompt)

With cache commented out: 4.38 s ± 429 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With cache: 4.06 s ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

lapp0 commented 9 months ago

https://github.com/outlines-dev/outlines/pull/566 should have fixed this. It invalidates the cache if the version is upgraded.

Can you confirm which version you are running via from outlines import _version; print(_version.version) (this command will include the git revision which is helpful for me)

dnhkng commented 9 months ago

I'm on: 0.0.25.dev15+g0cd9608

lapp0 commented 9 months ago

I found the source of the issue. Outlines cache is cleared if there's a version upgrade, however installing from git via pip doesn't seem to set the version in the same way that pip install . from the repo directory does.

root@C.8986380:~$ pip install outlines git+https://github.com/outlines-dev/outlines -q
root@C.8986380:~$ python3 -c "from outlines._version import __version__ as outlines_version; print(outlines_version)"
0.0.24

We need to ensure the version in from outlines._version import __version__ is distinct even if installed from pip. Thanks for helping us discover this!


@dnhkng as a temporary fix I recommend running rm -rf ~/.cache/outlines

lapp0 commented 9 months ago

Best route forward IMO:

lapp0 commented 9 months ago

Works in my environment. @dnhkng could you please confirm your reproduction code no longer fails in your conda environment if you run

rm -rf ~/.cache/outlines
pip install outlines==0.0.24
python3 your_script_in_original_post.py
pip install outlines==0.0.25
python3 your_script_in_original_post.py
dnhkng commented 9 months ago

Looks ok now!

I'm still getting strings instead of floats, but I've raised a separate issue for that.