Closed smwitkowski closed 1 year ago
Hello! Thanks so much for trying out the library and leaving feedback!
I'm planning on handling this use case via a dedicated Enum type (in the code called Selection as it mirrors HTML Input forms).
The node is here: https://github.com/eyurtsev/kor/blob/main/kor/nodes.py#L161, but support for it might not be plumbed through the entire code at the moment (not sure). I'll update this issue as soon as I release support.
The first pass will only update the information that gets output in the prompt to help guide the LLM as to which values are seen as valid.
There's going to be a separate (and larger) effort to hook up standard validation frameworks, so it easy to specify what constitutes a valid extraction.
Let me know if you have any thoughts / concerns.
Also curious do you extract content from HTMLs, PDFs or raw text? And what is the length of your typical content?
The node is here: https://github.com/eyurtsev/kor/blob/main/kor/nodes.py#L161, but support for it might not be plumbed through the entire code at the moment (not sure). I'll update this issue as soon as I release support.
Awesome, thanks for sharing! I can confirm after using Selection
and Option
I'm not (so far) only getting those aspects that I've defined. Here's the code I used to define the schema.
aspect_options = [
Option(
id="flavor",
description="Flavor",
examples=[
"<EXAMPLES REDACTED>"
]
),
Option(
id="texture",
description="Texture",
examples=[
"<EXAMPLES REDACTED>"
]
)
]
schema = Object(
id="review_aspect",
description="Extracts aspects from a review.",
attributes=[
Selection(
id="aspect",
description="Aspects mentioned in the review",
options=aspect_options,
many=True
)
]
)
I couldn't find information for different node types in the documentation. Is this something I could add?
The first pass will only update the information that gets output in the prompt to help guide the LLM as to which values are seen as valid.
There's going to be a separate (and larger) effort to hook up standard validation frameworks, so it [is] easy to specify what constitutes a valid extraction.
I'm curious, could you tell me more about that? Is it that schema is built in a particular way that it's only implied that the only valid options are what's included, and that in the future you mean to include more explicit instruction on what is a valid extraction in the prompt itself?
Also curious do you extract content from HTMLs, PDFs or raw text? And what is the length of your typical content?
I'm primarily focused on raw text. The length varies, sometimes it's only a few words and othertimes it's a paragraph of roughly 500 words. The text is very similar to reviews on Amazon items.
I may find myself working with HTML in the future, but it would only be the text contained within an HTML span, not the actual HTML itself.
I don't plan on working with any PDFs at this time.
I couldn't find information for different node types in the documentation. Is this something I could add?
Yes, please do!
I'm curious, could you tell me more about that? Is it that schema is built in a particular way that it's only implied that the only valid options are what's included, and that in the future you mean to include more explicit instruction on what is a valid extraction in the prompt itself?
Currently the schema is designed to support two aspects of prompt generation: (1) the input/output examples, (2) generating the schema in the instruction
Input/output examples:
The schema is designed to support a convenient way to specify extraction examples on individual fields. During prompt generation the schema is traversed (across any level of nesting) to aggregate the examples and produce an appropriate prompt. I don't know how well providing examples on individual fields works in terms of extraction quality (though betting that quality will improve with newer LLMs rapidly).
Schema in the instruction:
The schema is scanned to generate a type definition (e.g., in typescript). I will likely add other type definitions if those will help with extraction (e.g., for tabular extraction it may be that generating a postgres style schema would help the model figure out what's required).
Both of these aspects of prompt generation only control the inputs into the LLM, but they don't help validate the outputs. As the LLMs become better, I'm betting they'll start better understanding the schema and making less mistakes.
Regardless in the meantime, validation needs to happen on the output of the LLM. I didn't implement anything yet, so at the moment it's up to users of the library to do so. But roughly what I'm thinking is to have folks define schema using their favorite python libraries (e.g., pydantic or marshmallow etc.), and having a bit of utility code that maps from something like pydantic to kor's internal representation of objects. This also means that the LLM output can be easily validated using either pydantic or marshmallow.
Here's a PR that exposes the selection and option nodes. Looks like the existing code is already working correctly at least for typescript descriptors. https://github.com/eyurtsev/kor/pull/85
@smwitkowski Working on a pydnatic adapter here: https://github.com/eyurtsev/kor/pull/86
Shared a screenshot to show conversion from pydantic into internal object, validation not hooked up yet, but shouldn't be difficult to add.
I'll need to work out some details (e.g., are self-referential types allowed etc.)
Both of these aspects of prompt generation only control the inputs into the LLM, but they don't help validate the outputs. As the LLMs become better, I'm betting they'll start better understanding the schema and making less mistakes.
Regardless in the meantime, validation needs to happen on the output of the LLM.
Yes to all of this - I also expect including explicit instruction in the instructions, in addition to the schema, would help ensure that the LLM understand what is "valid" and what is not.
I ran my example today on ~10K documents and I was given many aspects that were not defined. While using better LLMs would likely remove this issue (FWIW I used gpt-3.5-turbo
, not gpt-4
), it's quite costly to default towards higher end LLMs today.
I also expect including explicit instruction in the instructions, in addition to the schema, would help ensure that the LLM understand what is "valid" and what is not.
That's probably right.
If you make modifications to the prompt that seem to help, I'd be very interested to know what they are.
Kor will likely need a benchmark dataset at some point to help experimenting with the prompt.
Fwiw at the moment all the important instructions in the system message which openai claims the model doesn't pay as much attention to.
Another possibility Is doing a second pass on the extraction with an LLM to correct deviations from a schema.
@smwitkowski
Preview of validation using pydantic https://github.com/eyurtsev/kor/blob/df6fce8bc49037a61da0945d791349fb4f2b98f0/docs/source/validation.ipynb
PR merged into main. Going to close this issue for now.
Hi there, do we have an enum or valid_values attribute for the Text object we can use?
@pedrocr83 The easiest way to achieve this is using pydantic. https://eyurtsev.github.io/kor/validation.html
It supports Enums as well as arbitrary validation logic using pydantic field validators
Hey - thanks for creating kor! I'm eager to start using it in my day job, in particular for aspect-based sentiment analysis.
I'm looking at reviews on food items and would like to label each review with an "aspect" if that aspect is mentioned in the review. You could begin to imagine which aspects are most relevant for this use case, flavor and texture are two that come to mind immediately. I want to limit this labeling exercise to only aspects that I am interested in.
I expect that including these instructions in the prompt would be sufficient, and I think two ways could be incorporated into kor.
The most straightforward way is to allow an end user to alter the prompt, but I'd prefer the second solution listed below.
The second seems to be a better long-term solution perhaps.
AbstractSchemaNode
could accept a new parametervalid_values
, which would indicate what are valid values for a given key defined in the attribute.Then, that
valid_values
would be passed togenerate_instruction_segment
along with the node, and the prompt would be updated to include instructions on how to restrict which values are returned foraspect
.https://github.com/eyurtsev/kor/blob/c3066c11adab0b8fceaebc91f91c43984f206e04/kor/prompts.py#L89-L93
Happy to help contribute to this if it seems helpful!