dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.72k stars 498 forks source link

generation doesnt follow json schema #1176

Open abhishekkrthakur opened 2 months ago

abhishekkrthakur commented 2 months ago

Describe the issue as clearly as possible:

In [1]: import outlines
schema =
In [2]: schema = '{"type": "object", "value": {"properties": {"data": {"type": "array", "maxItems": 10, "minItems": 1, "items": {"type": "array", "properties": {"text": {"type": "string"}
   ...: , "target": {"type": "string"}}, "required": ["text", "target"]}}}, "required": ["data"]}}'

In [3]: model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.31s/it]

In [4]:

In [4]: generator = outlines.generate.json(model, schema)
Compiling FSM index for all state transitions: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 243/243 [00:01<00:00, 138.00it/s]

In [5]: msg = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are an AI bot that generates data for t
   ...: ext classification tasks.\nYou do not repeat the question asked by user. You do not generate code.\nOnly thing you generate is text data in the specified format.\nThe user provide
   ...: s a problem statement and you generate the data.\nFor text classification task, the user provides different classes.\nIf the user has not provided the classes, generate the classe
   ...: s as well but limit the number of classes to 10.\n\nThe dataset for text classification is in JSON format.\nEach line should be a JSON object with the following keys: text and tar
   ...: get.\nMake sure each text sample has atleast 10 words.\nThe target must always be a string.\nDon't write what you are doing. Just generate the data.\nEach line of the output consi
   ...: sts of a dictionary with two keys: text and target and nothing else.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI want to train a model for short text sentiment classifi
   ...: cation<|eot_id|>"

In [6]: generator(msg)
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
You are not running the flash-attention implementation, expect numerical differences.
Out[6]:
{'text': 'Awesome service! I am delighted with my purchase.',
 'target': 'positive'}

Steps/code to reproduce the bug:

above

Expected result:

above

Error message:

No response

Outlines/Python version information:

Version information

``` (command output here) ```

Context for the issue:

No response