dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.57k stars 490 forks source link

Recursive JSON Schemas #330

Open sahewat opened 1 year ago

sahewat commented 1 year ago

Recursive Pydantic definitions seem unsupported for lists, unions, and optionals. My understanding is these are the basic use cases.

A reproducible example is provided below:

import json

from typing import List, Optional, Union
from pydantic import BaseModel

from outlines.text.json_schema import build_regex_from_schema

class TaskOptional(BaseModel):
    subtask: Optional['TaskOptional']

class TaskWrapperOptional(BaseModel):
    task: TaskOptional

class TaskUnion(BaseModel):
    subtask: Union['TaskUnion', None]

class TaskWrapperUnion(BaseModel):
    task: TaskUnion

class TaskList(BaseModel):
    subtask: List['TaskList']

class TaskWrapperList(BaseModel):
    task: TaskList

TaskWrapperOptional.model_rebuild()
TaskWrapperUnion.model_rebuild()
TaskWrapperList.model_rebuild()

regex = build_regex_from_schema(json.dumps(TaskWrapperOptional.model_json_schema()))
# NotImplementedError: 
# @ outlines/text/json_schema.py:308 

regex = build_regex_from_schema(json.dumps(TaskWrapperUnion.model_json_schema()))
# NotImplementedError: 
# @ outlines/text/json_schema.py:308 

regex = build_regex_from_schema(json.dumps(TaskWrapperList.model_json_schema()))
# RecursionError: maximum recursion depth exceeded while calling a Python object
# @ outlines/text/json_schema.py:174 

I'd be interested in adding this functionality but I'm unsure as to what an "unrolled" recursive definition would look like in terms of the generated regex.

rlouf commented 1 year ago

Thank you for opening an issue! It looks like something we should be able to support. Do you mind pasting the JSON schema for these 3 models here?

sahewat commented 1 year ago

In all cases, the schema is available through TaskWrapperxxx.model_json_schema(). I'll post an example of each here as well.

TaskWrapperOptional

{
    "$defs": {
        "Task": {
            "properties": {
                "name": {
                    "title": "Name",
                    "type": "string"
                },
                "subtasks": {
                    "default": [],
                    "items": {
                        "$ref": "#/$defs/Task"
                    },
                    "title": "Subtasks",
                    "type": "array"
                }
            },
            "required": [
                "name"
            ],
            "title": "Task",
            "type": "object"
        },
        "TaskOptional": {
            "properties": {
                "subtask": {
                    "anyOf": [
                        {
                            "$ref": "#/$defs/Task"
                        },
                        {
                            "type": "null"
                        }
                    ]
                }
            },
            "required": [
                "subtask"
            ],
            "title": "TaskOptional",
            "type": "object"
        }
    },
    "properties": {
        "task": {
            "$ref": "#/$defs/TaskOptional"
        }
    },
    "required": [
        "task"
    ],
    "title": "TaskWrapperOptional",
    "type": "object"
}

TaskWrapperUnion

{
    "$defs": {
        "TaskUnion": {
            "properties": {
                "subtask": {
                    "anyOf": [
                        {
                            "$ref": "#/$defs/TaskUnion"
                        },
                        {
                            "type": "null"
                        }
                    ]
                }
            },
            "required": [
                "subtask"
            ],
            "title": "TaskUnion",
            "type": "object"
        }
    },
    "properties": {
        "task": {
            "$ref": "#/$defs/TaskUnion"
        }
    },
    "required": [
        "task"
    ],
    "title": "TaskWrapperUnion",
    "type": "object"
}

TaskWrapperList

{
    "$defs": {
        "TaskList": {
            "properties": {
                "subtask": {
                    "items": {
                        "$ref": "#/$defs/TaskList"
                    },
                    "title": "Subtask",
                    "type": "array"
                }
            },
            "required": [
                "subtask"
            ],
            "title": "TaskList",
            "type": "object"
        }
    },
    "properties": {
        "task": {
            "$ref": "#/$defs/TaskList"
        }
    },
    "required": [
        "task"
    ],
    "title": "TaskWrapperList",
    "type": "object"
}
brandonwillard commented 6 months ago

Anyone who is interested in this feature should know that CFG-structured generation is required to truly support it.

hugocool commented 1 month ago

I saw there is a pull request that implement a beta version of CFG guided generation, which is amazing. But the request failed, is that everything thats necessary to get this functionality available? The PR failed due to a regression in a performance benchmark, i believe in a measurement that didnt really exist before, so is there anything that i can do to test/help get this recursive field functionaility over the line? I really need this recursive functionality, and am considering switching to a different structured generation library (https://github.com/noamgat/lm-format-enforcer), however that one seems like it is much less mature and i do intend to use this for production usecases.

Anyway, if there is anything i can do to help please let me know.

lapp0 commented 1 month ago

@hugocool To have stable CFG-based JSON generation we need

There may be other paths forward, but this is the approach immediately obvious to me.

This isn't an area of focus of mine at the moment, but if you're interested in tackling either issue, please let me know what questions you have!

hugocool commented 1 month ago

Okay, i am willing to pick this up. So i have some questions, what is the current state of the Lark grammer generation? Which elements of the PR (https://github.com/lapp0/outlines/pull/85) can i build upon? or any attempt for that matter? Are you aware of lm-format-enforcer and their approach to solving this problem? What elements of their learnings should we incorporate?

I think i would start with generating a Lark grammer for my specific usecase, which is a specific recursive JSON model. Then if that works we can see how to generalize it so it can work for any arbitrary JSON schema. I am assuming i should build of of these examples:

Are there any more resources i should be aware off?

Lastly, i am assuming that the second issue you mentioned, would come into play once we would like to generalize the solution to JSON more broadly, right?