Various types of examples for various types of documents

MauroLondon commented 1 year ago

Hello how are you? I want to tell you that you have created a great solution, and for something recent it works very well.

Now, my case is the following, I want to parse various types of documents that, although it is true, have the same fields, the order is not the same, so I tried to create an example for each case but it works fine for me when the examples are at the beginning , but mixing multiple examples doesn't work well. So, I want to look for a way that depending on the type of document, I get the proper examples of that type. Any idea how to do this in the best possible way?

eyurtsev commented 1 year ago

Hi @MauroLondon,

have the same fields, the order is not the same,

What do "fields" mean in this case?

Are they referring to the fields of the extraction output? i.e., meaning that the goal is to extract the same information out of the different documents?

Or does it mean that the source documents are forms that have similar fields? (e.g., job resumes and attempting to detect some common fields like first name, last name, education history etc.)

I want to look for a way that depending on the type of document, I get the proper examples of that type

Some ideas:

Are the different types of documents easy to tell apart? If so, you could implement a classification step and based on the predicted class route the document to the appropriate extraction chain. This assumes that it's possible to create a few different extraction chains (each using different schema and examples) to represent the relevant extraction scenarios.

Another option is to create a large set of examples, and dynamically create a chain with a subset of the most relevant examples. The examples can be chosen based on semantic similarity. To do this, you could treat each example as a "document" with langchain, create a retriever and retrieve the most relevant examples that could help extract content from a particular target document.

MauroLondon commented 1 year ago

Hi @eyurtsev , I can't help but admire the dedication and effort you put into answering questions and improving the Kor.

I apologize if I didn't know how to explain. When I say fields I mean to the name of the id attribute that I want to get

I have several accounts payable documents, some come in pdf format, others in excel. The names of the columns are usually the same: type, invoice number, invoice date, invoice amount. So what is often different is the order of the headers or the format of the date, or the format of the invoice number. Other documents include other additional fields in the header. Some have subtotals, others just have the grand total. So, I've realized that I can't put all the examples in the same list.

Indeed I could perform a sorting algorithm using for example regular expressions to get the file types and based on this get the set of respective examples, however I feel that in this way I stray away from AI.

"Another option is to create a large set of examples, and dynamically create a chain with a subset of the most relevant examples. The examples can be chosen based on semantic similarity."

Regarding that last approach, I would like to do it that way, although I don't have much idea how to do it (I've barely been developing solutions using AI for a month and a half), I'll dig deeper, to come up with a solution about it.

In this sense, for example, I already made a solution to self-audit the results, since with openai.ChatCompletion I am getting the grand total of the invoice amounts, which I then compare with the sum of all the invoice amounts obtained with "Kor". This allows me to visualize in part, if everything is going well with the parser of the records.

So, I also hope to be able to create a solution for the semantic approach that you propose to perform.....

Thanks in advance

eyurtsev commented 1 year ago

If you can use regular expressions or other methods to create features to differentiate the documents apart, then you can put them into a classic logistic regression or random forest to create an AI approach for classification: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Alternatively, if the documents can be told apart based on the words present in them, a bag of words representation would make for a good collection of features (see sklearn).

For semantic approach, follow the retrievers guide in langchain. https://python.langchain.com/docs/modules/data_connection/retrievers/

You can also investigate the langchain implementation for few shot templates: https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/few_shot_examples#feed-examples-into-exampleselector (SemanticSimilarityExampleSelector)

eyurtsev commented 1 year ago

@MauroLondon closing issue for now as this is not a bug. Feel free to use discussions on github to open a discussion for open ended questions.

eyurtsev / kor

Various types of examples for various types of documents #207