Introduce `EXTRACT_COLUMNS` to extract structured tables from unstructured text

xzdandy commented 10 months ago

Search before asking

[X] I have searched the EvaDB issues and found no similar feature requests.

Description

EXTRACT_COLUMNS will be similar to EXTRACT_OBJECT for videos, which is not a standard user defined functions. In optimizer, it will be translated to a valid EvaDB query plan tree with multiple functions and operators.

Example Usage

EXTRACT_COLUMNS(
    "gpt-3.5-turbo", 
    "faiss",
    [
        ["name", "name of the user profile", "logicx"], 
        ["country", "country the user comes from", "United States"],
        ["age", "age of the user", 30],
    ], 
     input_source
)

The first argument specifies the llm model to use
The second argument specifies the vector database to use. If the second column is "", then RAG will not be used. In the first release of EXTRACT_COLUMNS, we will not support RAG.
The third augments specifies the column we want to extract, for every column, we specify
- the name of column
- natural language to describe how to extract that column
- an example value, column type is inferred from the example value.
The fourth is the input_relationship
The output returns a batched panda dataframe that contains those extracted columns. This is a one-to-one mapping for the input_relationship.

If we want to provide more fined grained controls, for example tuning hyper paramters, we can also introduce a CREATE FUNCTION, which allows us to have a key-value based configuration.

@gaurav274 @jiashenC Please provide feedback. Thanks.

Use case

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

pchunduri6 commented 10 months ago

[
    ["name", "name of the user profile", "logicx"], 
    ["country", "country the user comes from", "United States"]
],

How would this translate to the LLM prompt in the background -- e.g., one prompt for each column, single prompt by combining all columns
The LLM extraction is brittle, so careful prompt engineering is required. Is it safe to use this structure without providing the option to engineer the prompt?
With RAG queries, there is information loss, so accurate extraction will get trickier. Tracking the output accuracy could be challenging.

xzdandy commented 10 months ago

Hi @pchunduri6, very good feedback.

1) Backend will be translated to what we have similar in the stargazers. Optimization like batching will be applied accordingly. There are also new opportunities like merging, for example, some columns are extracted in predicate while some are extracted in projection.

GPT35Azure("You are given a block of disorganized text extracted from the GitHub user profile of a user using an automated web scraper. The goal is to get structured results from this data.
                Extract the following fields from the text: name, country, city, email, occupation, programming_languages, topics_of_interest, social_media.
                If some field is not found, just output fieldname: N/A. Always return all the 8 field names. DO NOT add any additional text to your output.
                The topic_of_interest field must list a broad range of technical topics that are mentioned in any portion of the text.  This field is the most important, so add as much information as you can. Do not add non-technical interests.
                The programming_languages field can contain one or more programming languages out of only the following 4 programming languages - Python, C++, JavaScript, Java. Do not include any other language outside these 4 languages in the output. If the user is not interested in any of these 4 programming languages, output N/A.
                If the country is not available, use the city field to fill the country. For example, if the city is New York, fill the country as United States.
                If there are social media links, including personal websites, add them to the social media section. Do NOT add social media links that are not present.
                Here is an example (use it only for the output format, not for the content):

                name: logicx
                country: United States
                city: Atlanta
                email: abc@gatech.edu
                occupation: PhD student at Georgia Tech
                programming_languages: Python, Java
                topics_of_interest: Google Colab, fake data generation, Postgres
                social_media: https://www.logicx.io, https://www.twitter.com/logicx, https://www.linkedin.com/in/logicx
                ", stargazerscrapeddetails.extracted_text
                )

2) I have exactly similar thoughts. We can provide a full prompt to the engineer. But non advanced user may not know how to write a proper prompt for this purpose. The proposed interface is more user friendly and simple. I agree it can lose some accuracy but power users can always write the above fully customized query in EvaDB. For this asepct, I am eager to see more feedback on the design.
3) Feedback on RAG is helpful. Is RAG useful for extracting column information? or when it will be useful, since the current stargazer does not use that. And it is also easier to implement the EXTRACT_COLUMNS without RAG. We need to evaluate the efforts and gains.

hershd23 commented 9 months ago

Hey @gaurav274 introduced me to this issue.

Seems interesting. Can I take it up?

xzdandy commented 9 months ago

HI @hershd23 , thanks for your interest! Yes!

hershd23 commented 9 months ago

https://github.com/hershd23/eva-structure-gpt

Have something up just as a quick and dirty POC. Mostly testing for the testing of the prompt which I build incrementally. I think this is good enough to start work on the function itself

georgia-tech-db / evadb