cohere-ai / notebooks

Code examples and jupyter notebooks for the Cohere Platform
MIT License
451 stars 120 forks source link

Notebook for agents extracting PDF text. #186

Closed co-jason closed 3 months ago

CLAassistant commented 3 months ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

shaandes commented 3 months ago

Introduction: fixed typos

Generally, users are limited to text inputs when using large language models (LLMs), but agents enable the model to do more than ingest or output text information. Using tools, LLMs can call other APIs, save data, and much more. In this notebook, we will explore how we can leverage agents to extract information from PDFs. We will mimic an application where the user uploads PDF files and the agent extracts useful information. This can be useful when the text information has varying formats and you need to extract various types of information.

In the directory, we have a simple_invoice.pdf file. Everytime a user uploads the document, the agent will extract key information total_amount and invoice_number and then save it as CSV which then can be used in another application. We only extract two pieces of information in this demo, but users can extend the example and extract a lot more information.

Steps
1. extract_pdf() function extracts text data from the PDF using [unstructured](https://unstructured.io/) package. You can use other packages like PyPDF2 as well.
2. This extracted text is added to the prompt so the model can "see" the document.
3. The agent summarizes the document and passes that information to convert_to_json() function. This function makes another call to command model to convert the summary to json output. This separation of tasks is useful when the text document is complicated and long. Therefore, we first distill the information and ask another model to convert the text into json object. This is useful so each model or agent focuses on its own task without suffering from long context.
4. Then the json object goes through a check to make sure all keys are present and gets saved as a csv file. When the document is too long or the task is too complex, the model may fail to extract all information. These checks are then very useful because they give feedback to the model so it can adjust it's parameters to retry.

Package dependencies: can we add some one-liners to the notebook (or call out what’s needed for unstructured as I know some of these have nested dependencies) e.g.

!pip install cohere
!pip install unstructured

Print statements Convert this: print(f"\nrunning {counter}th step.")

To: print(f"\nrunning step {counter}.")