eyurtsev / kor

LLM(😽)
https://eyurtsev.github.io/kor/
MIT License
1.62k stars 88 forks source link

Error in extract_from_documents function #127

Closed hitsense closed 1 year ago

hitsense commented 1 year ago

I am unable to run the last step in the document extraction article. The function extract_from_documents returns below error -

[TypeError("__init__() got an unexpected keyword argument 'line_terminator'"),
 TypeError("__init__() got an unexpected keyword argument 'line_terminator'"),
 TypeError("__init__() got an unexpected keyword argument 'line_terminator'"),
 TypeError("__init__() got an unexpected keyword argument 'line_terminator'")]

Looks like something changed at the langchain end

eyurtsev commented 1 year ago

Hello @hitsense :wave:. Thanks for taking the library for a spin!

kor doesn't have a variable called line_terminator anywhere in its codebase, so it's either a bug in langchain or a typo in the code that you're using. Try bumping langchain, if that doesn't help look for any place in the code you're using that sets line_terminator as that would be the source of the error. Good luck!

hitsense commented 1 year ago

Hi @eyurtsev, I get this error for your article as well. I am not using anything else that uses line_terminator. Can you try running your document extraction article and check if you get the same error? I am using the latest version of langchain, and if you are using a specific version of langchain, then let me know.

eyurtsev commented 1 year ago

I can confirm that this runs fine for me using newest langchain.

Could you include your stack trace for the exceptions?

You could try to run the code with return_exceptions = False so the exception is raised instead of returned.

    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=False
    )
hitsense commented 1 year ago

Something wrong with encoder and pandas. Here is the trace back -

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-75-05f5682b3d08>](https://localhost:8080/#) in <cell line: 1>()
      1 with get_openai_callback() as cb:
----> 2     document_extraction_results = await extract_from_documents(
      3         chain, txtt, max_concurrency=5, use_uid=False, return_exceptions=False
      4     )
      5 

17 frames
[/usr/local/lib/python3.9/dist-packages/kor/extraction/api.py](https://localhost:8080/#) in extract_from_documents(chain, documents, max_concurrency, use_uid, extraction_uid_function, return_exceptions)
    170         )
    171 
--> 172     results = await asyncio.gather(*tasks, return_exceptions=return_exceptions)
    173     return results

[/usr/local/lib/python3.9/dist-packages/kor/extraction/api.py](https://localhost:8080/#) in _extract_from_document_with_semaphore(semaphore, chain, document, uid, source_uid)
     26     async with semaphore:
     27         extraction_result: Extraction = cast(
---> 28             Extraction, await chain.apredict_and_parse(text=document.page_content)
     29         )
     30         return {

[/usr/local/lib/python3.9/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in apredict_and_parse(self, **kwargs)
    179     ) -> Union[str, List[str], Dict[str, str]]:
    180         """Call apredict and then parse the results."""
--> 181         result = await self.apredict(**kwargs)
    182         if self.prompt.output_parser is not None:
    183             return self.prompt.output_parser.parse(result)

[/usr/local/lib/python3.9/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in apredict(self, **kwargs)
    165                 completion = llm.predict(adjective="funny")
    166         """
--> 167         return (await self.acall(kwargs))[self.output_key]
    168 
    169     def predict_and_parse(self, **kwargs: Any) -> Union[str, List[str], Dict[str, str]]:

[/usr/local/lib/python3.9/dist-packages/langchain/chains/base.py](https://localhost:8080/#) in acall(self, inputs, return_only_outputs)
    152             else:
    153                 self.callback_manager.on_chain_error(e, verbose=self.verbose)
--> 154             raise e
    155         if self.callback_manager.is_async:
    156             await self.callback_manager.on_chain_end(outputs, verbose=self.verbose)

[/usr/local/lib/python3.9/dist-packages/langchain/chains/base.py](https://localhost:8080/#) in acall(self, inputs, return_only_outputs)
    146             )
    147         try:
--> 148             outputs = await self._acall(inputs)
    149         except (KeyboardInterrupt, Exception) as e:
    150             if self.callback_manager.is_async:

[/usr/local/lib/python3.9/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in _acall(self, inputs)
    133 
    134     async def _acall(self, inputs: Dict[str, Any]) -> Dict[str, str]:
--> 135         return (await self.aapply([inputs]))[0]
    136 
    137     def predict(self, **kwargs: Any) -> str:

[/usr/local/lib/python3.9/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in aapply(self, input_list)
    121     async def aapply(self, input_list: List[Dict[str, Any]]) -> List[Dict[str, str]]:
    122         """Utilize the LLM generate method for speed gains."""
--> 123         response = await self.agenerate(input_list)
    124         return self.create_outputs(response)
    125 

[/usr/local/lib/python3.9/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in agenerate(self, input_list)
     64     async def agenerate(self, input_list: List[Dict[str, Any]]) -> LLMResult:
     65         """Generate LLM result from inputs."""
---> 66         prompts, stop = await self.aprep_prompts(input_list)
     67         return await self.llm.agenerate_prompt(prompts, stop)
     68 

[/usr/local/lib/python3.9/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in aprep_prompts(self, input_list)
     98         for inputs in input_list:
     99             selected_inputs = {k: inputs[k] for k in self.prompt.input_variables}
--> 100             prompt = self.prompt.format_prompt(**selected_inputs)
    101             _colored_text = get_colored_text(prompt.to_string(), "green")
    102             _text = "Prompt after formatting:\n" + _colored_text

[/usr/local/lib/python3.9/dist-packages/kor/prompts.py](https://localhost:8080/#) in format_prompt(self, text)
     80         text = format_text(text, input_formatter=self.input_formatter)
     81         return ExtractionPromptValue(
---> 82             string=self.to_string(text), messages=self.to_messages(text)
     83         )
     84 

[/usr/local/lib/python3.9/dist-packages/kor/prompts.py](https://localhost:8080/#) in to_string(self, text)
     95         """Format the template to a string."""
     96         instruction_segment = self.format_instruction_segment(self.node)
---> 97         encoded_examples = self.generate_encoded_examples(self.node)
     98         formatted_examples: List[str] = []
     99 

[/usr/local/lib/python3.9/dist-packages/kor/prompts.py](https://localhost:8080/#) in generate_encoded_examples(self, node)
    131         """Generate encoded examples."""
    132         examples = generate_examples(node)
--> 133         return encode_examples(
    134             examples, self.encoder, input_formatter=self.input_formatter
    135         )

[/usr/local/lib/python3.9/dist-packages/kor/encoders/encode.py](https://localhost:8080/#) in encode_examples(examples, encoder, input_formatter)
     57     """Encode the output using the given encoder."""
     58 
---> 59     return [
     60         (
     61             format_text(input_example, input_formatter=input_formatter),

[/usr/local/lib/python3.9/dist-packages/kor/encoders/encode.py](https://localhost:8080/#) in <listcomp>(.0)
     60         (
     61             format_text(input_example, input_formatter=input_formatter),
---> 62             encoder.encode(output_example),
     63         )
     64         for input_example, output_example in examples

[/usr/local/lib/python3.9/dist-packages/kor/encoders/csv_data.py](https://localhost:8080/#) in encode(self, data)
     75             # Should always output records for pd.Dataframe
     76             data_to_output = [data_to_output]
---> 77         table_content = pd.DataFrame(data_to_output, columns=field_names).to_csv(
     78             index=False, sep=DELIMITER
     79         )

[/usr/local/lib/python3.9/dist-packages/pandas/core/generic.py](https://localhost:8080/#) in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
   3549         header: bool_t | list[str] = True,
   3550         index: bool_t = True,
-> 3551         index_label: IndexLabel | None = None,
   3552         mode: str = "w",
   3553         encoding: str | None = None,

[/usr/local/lib/python3.9/dist-packages/pandas/io/formats/format.py](https://localhost:8080/#) in to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
   1159         """
   1160         Render dataframe as comma-separated file.
-> 1161         """
   1162         from pandas.io.formats.csvs import CSVFormatter
   1163 

TypeError: __init__() got an unexpected keyword argument 'line_terminator'
eyurtsev commented 1 year ago

Definitely pandas associated. I successfully ran the code with pandas 1.5.3 and with pandas 2.0.0. Which version are you using?

This also looks like something that's internal to pandas since kor isn't specifying a line_terminator named argument and instead is using sep.

One thing I noticed is that pandas 2.0.0 uses lineterminator rather than line_terminator: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

But I'm still not sure why that error would even surface since kor isn't specifying a line-terminator

In the stack trace that you provided it does:

---> 77         table_content = pd.DataFrame(data_to_output, columns=field_names).to_csv(
     78             index=False, sep=DELIMITER
     79         )
hitsense commented 1 year ago

This was resolved after updating pandas from 1.4.4 to 1.5.3 Thanks @eyurtsev