eyurtsev / kor

LLM(😽)
https://eyurtsev.github.io/kor/
MIT License
1.61k stars 88 forks source link

kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n #283

Closed dantepalacio closed 5 months ago

dantepalacio commented 5 months ago

Hi, I want to use kor with opensource openchat model(https://huggingface.co/openchat/openchat-3.5-0106). I know that this model has a certain suffix for prompt. Here is an example below: GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:

My current code:

import torch

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = "openchat/openchat-3.5-0106"

tokenizer = AutoTokenizer.from_pretrained(model_id)

pipeline = pipeline(
    "text-generation", #task
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16,
    max_length=1000,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,

)
hf = HuggingFacePipeline(pipeline=pipeline, model_kwargs={'temperature':0})

from langchain.prompts import PromptTemplate

INSTRUCTION_TEMPLATE = PromptTemplate(
    input_variables=["type_description", "format_instructions"],
    template='''GPT4 Correct System:Your goal is to extract structured information from the user's input that
matches the form described below. When extracting information please make
sure it matches the type information exactly. Do not add any attributes that
do not appear in the schema shown below.<|end_of_turn|>\n\n
GPT4 Correct User:
{type_description}\n\n
{format_instructions}<|end_of_turn|>\n\n
GPT4 Correct Assistant:''')

from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number

schema = Object(
    id="person",
    description="Personal information",
    examples=[
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ],
    attributes=[
        Text(
            id="first_name",
            description="The first name of a person.",
        )
    ],
    many=True,
)

chain = create_extraction_chain(hf, schema, instruction_template=INSTRUCTION_TEMPLATE)
chain.run(("My name is Bobby. My brother's name Joe."))

When I insert these suffixes into the prompt and run the chain, the generation goes through, but an error comes out on output:

{'data': {},
 'raw': "GPT4 Correct System:Your goal is to extract structured information from the user's input that\nmatches the form described below. When extracting information please make\nsure it matches the type information exactly. Do not add any attributes that\ndo not appear in the schema shown below.<|end_of_turn|>\n\n\nGPT4 Correct User:\n```TypeScript\n\nperson: Array<{ // Personal information\n first_name: string // The first name of a person.\n}>\n```\n\n\n\nPlease output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. \n Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.<|end_of_turn|>\n\n\nGPT4 Correct Assistant:\n\nInput: Alice and Bob are friends\nOutput: first_name\nAlice\nBob\n\nInput: My name is Bobby. My brother's name Joe.\nOutput: first_name\nBobby\nJoe",
 'errors': [kor.exceptions.ParseError(pandas.errors.ParserError('Error tokenizing data. C error: Expected 1 fields in line 4, saw 3\n'))],
 'validated_data': {}}

I realize it is because of the suffixes, but how can I avoid it? what do I need to rewrite?

And if I don't specify the instruction_template parameter in the create_extraction_chain function, it can take 15-20 minutes to run the chain and give me complete nonsense.

Any help would be appreciated updated