eyurtsev / kor

LLM(😽)
https://eyurtsev.github.io/kor/
MIT License
1.6k stars 88 forks source link

When I use sample from offical document and reduce the number of attributes from 3 to 2, error occured. #130

Closed ahalamora1981 closed 1 year ago

ahalamora1981 commented 1 year ago

This is the sample I copied from official document, I remove the age make the number of attrbutes to 2. When I run "chain.predict_and_parse(text=text)["data"]", error occured.

I tried some other cases with my custom use case, it seems that if the number of attributes is 2 and I add examples in Object, the error always occurs. If I don't add examples in Object, in stead I add examples in attributes, it is OK.

schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.",
            [
                {"first_name": "John", "last_name": "Smith"},
                {"first_name": "Jane", "last_name": "Doe"},
            ],
        )
    ],
    many=True,
)

Error:

ValueError Traceback (most recent call last) Cell In[178], line 1 ----> 1 output = chain.predict_and_parse(text=text)["data"]

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:171, in LLMChain.predict_and_parse(self, kwargs) 169 def predict_and_parse(self, kwargs: Any) -> Union[str, List[str], Dict[str, str]]: 170 """Call predict and then parse the results.""" --> 171 result = self.predict(**kwargs) 172 if self.prompt.output_parser is not None: 173 return self.prompt.output_parser.parse(result)

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:151, in LLMChain.predict(self, kwargs) 137 def predict(self, kwargs: Any) -> str: 138 """Format prompt with kwargs and pass to LLM. 139 140 Args: (...) 149 completion = llm.predict(adjective="funny") 150 """ --> 151 return self(kwargs)[self.output_key]

File D:\Dev\venv310\lib\site-packages\langchain\chains\base.py:116, in Chain.call(self, inputs, return_only_outputs) 114 except (KeyboardInterrupt, Exception) as e: 115 self.callback_manager.on_chain_error(e, verbose=self.verbose) --> 116 raise e 117 self.callback_manager.on_chain_end(outputs, verbose=self.verbose) 118 return self.prep_outputs(inputs, outputs, return_only_outputs)

File D:\Dev\venv310\lib\site-packages\langchain\chains\base.py:113, in Chain.call(self, inputs, return_only_outputs) 107 self.callback_manager.on_chain_start( 108 {"name": self.class.name}, 109 inputs, 110 verbose=self.verbose, 111 ) 112 try: --> 113 outputs = self._call(inputs) 114 except (KeyboardInterrupt, Exception) as e: 115 self.callback_manager.on_chain_error(e, verbose=self.verbose)

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:57, in LLMChain._call(self, inputs) 56 def _call(self, inputs: Dict[str, Any]) -> Dict[str, str]: ---> 57 return self.apply([inputs])[0]

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:118, in LLMChain.apply(self, input_list) 116 def apply(self, input_list: List[Dict[str, Any]]) -> List[Dict[str, str]]: 117 """Utilize the LLM generate method for speed gains.""" --> 118 response = self.generate(input_list) 119 return self.create_outputs(response)

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:61, in LLMChain.generate(self, input_list) 59 def generate(self, input_list: List[Dict[str, Any]]) -> LLMResult: 60 """Generate LLM result from inputs.""" ---> 61 prompts, stop = self.prep_prompts(input_list) 62 return self.llm.generate_prompt(prompts, stop)

File D:\Dev\venv310\lib\site-packages\langchain\chains\llm.py:79, in LLMChain.prep_prompts(self, input_list) 77 for inputs in input_list: 78 selected_inputs = {k: inputs[k] for k in self.prompt.input_variables} ---> 79 prompt = self.prompt.format_prompt(**selected_inputs) 80 _colored_text = get_colored_text(prompt.to_string(), "green") 81 _text = "Prompt after formatting:\n" + _colored_text

File D:\Dev\venv310\lib\site-packages\kor\prompts.py:82, in ExtractionPromptTemplate.format_prompt(self, text) 79 """Format the prompt.""" 80 text = format_text(text, input_formatter=self.input_formatter) 81 return ExtractionPromptValue( ---> 82 string=self.to_string(text), messages=self.to_messages(text) 83 )

File D:\Dev\venv310\lib\site-packages\kor\prompts.py:97, in ExtractionPromptTemplate.to_string(self, text) 95 """Format the template to a string.""" 96 instruction_segment = self.format_instruction_segment(self.node) ---> 97 encoded_examples = self.generate_encoded_examples(self.node) 98 formatted_examples: List[str] = [] 100 for in_example, output in encoded_examples:

File D:\Dev\venv310\lib\site-packages\kor\prompts.py:133, in ExtractionPromptTemplate.generate_encoded_examples(self, node) 131 """Generate encoded examples.""" 132 examples = generate_examples(node) --> 133 return encode_examples( 134 examples, self.encoder, input_formatter=self.input_formatter 135 )

File D:\Dev\venv310\lib\site-packages\kor\encoders\encode.py:59, in encode_examples(examples, encoder, input_formatter) 52 def encode_examples( 53 examples: Sequence[Tuple[str, str]], 54 encoder: Encoder, 55 input_formatter: InputFormatter = None, 56 ) -> List[Tuple[str, str]]: 57 """Encode the output using the given encoder.""" ---> 59 return [ 60 ( 61 format_text(input_example, input_formatter=input_formatter), 62 encoder.encode(output_example), 63 ) 64 for input_example, output_example in examples 65 ]

File D:\Dev\venv310\lib\site-packages\kor\encoders\encode.py:62, in (.0) 52 def encode_examples( 53 examples: Sequence[Tuple[str, str]], 54 encoder: Encoder, 55 input_formatter: InputFormatter = None, 56 ) -> List[Tuple[str, str]]: 57 """Encode the output using the given encoder.""" 59 return [ 60 ( 61 format_text(input_example, input_formatter=input_formatter), ---> 62 encoder.encode(output_example), 63 ) 64 for input_example, output_example in examples 65 ]

File D:\Dev\venv310\lib\site-packages\kor\encoders\csv_data.py:77, in CSVEncoder.encode(self, data) 74 if not isinstance(data_to_output, list): 75 # Should always output records for pd.Dataframe 76 data_to_output = [data_to_output] ---> 77 table_content = pd.DataFrame(data_to_output, columns=field_names).to_csv( 78 index=False, sep=DELIMITER 79 ) 81 if self.use_tags: 82 return wrap_in_tag("csv", table_content)

File D:\Dev\venv310\lib\site-packages\pandas\core\frame.py:762, in DataFrame.init(self, data, index, columns, dtype, copy) 754 mgr = arrays_to_mgr( 755 arrays, 756 columns, (...) 759 typ=manager, 760 ) 761 else: --> 762 mgr = ndarray_to_mgr( 763 data, 764 index, 765 columns, 766 dtype=dtype, 767 copy=copy, 768 typ=manager, 769 ) 770 else: 771 mgr = dict_to_mgr( 772 {}, 773 index, (...) 776 typ=manager, 777 )

File D:\Dev\venv310\lib\site-packages\pandas\core\internals\construction.py:349, in ndarray_to_mgr(values, index, columns, dtype, copy, typ) 344 # _prep_ndarraylike ensures that values.ndim == 2 at this point 345 index, columns = _get_axes( 346 values.shape[0], values.shape[1], index=index, columns=columns 347 ) --> 349 _check_values_indices_shape_match(values, index, columns) 351 if typ == "array": 353 if issubclass(values.dtype.type, str):

File D:\Dev\venv310\lib\site-packages\pandas\core\internals\construction.py:420, in _check_values_indices_shape_match(values, index, columns) 418 passed = values.shape 419 implied = (len(index), len(columns)) --> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (1, 1), indices imply (1, 2)

elyngved commented 1 year ago

I ran into this too. I was also trying 2 attributes and 1 example. When I added a 3rd attribute it worked.

eyurtsev commented 1 year ago

Thanks for reporting the issue!

Could you tell me your pandas version?

I'm out this week, will address on Monday. In the meantime as a workaround you could try using JSON encoding.

eyurtsev commented 1 year ago

OK Traced this down. It's not pandas, but pydantic.

I'll try to figure out how a fix for a sense of what's going on:

image

eyurtsev commented 1 year ago

@ahalamora1981 @elyngved -- Changing encoders won't help, looks like this is pydantic auto coercion kicking in and interpreting the union field as dict being initialized with an iterable, rather than a list of dicts.

Essentially doing something like this:

image

This magical behavior happens when using 2 fields. I'll try to push a fix in the next few days.

eyurtsev commented 1 year ago

Fix released in Version 0.8.1

eyurtsev commented 1 year ago

This issue has been fixed in pydantic v2 (which is still in pre-release)

eyurtsev commented 1 year ago

Closing issue -- fix is merged and released and unit tests have been added.

elyngved commented 1 year ago

@eyurtsev thank you for tracking it down! 🙇