georgia-tech-db / evadb

Database system for AI-powered apps
https://evadb.ai/docs
Apache License 2.0
2.62k stars 262 forks source link

Introduce `EXTRACT_COLUMNS` to extract structured tables from unstructured text #1235

Open xzdandy opened 10 months ago

xzdandy commented 10 months ago

Search before asking

Description

EXTRACT_COLUMNS will be similar to EXTRACT_OBJECT for videos, which is not a standard user defined functions. In optimizer, it will be translated to a valid EvaDB query plan tree with multiple functions and operators.

Example Usage

EXTRACT_COLUMNS(
    "gpt-3.5-turbo", 
    "faiss",
    [
        ["name", "name of the user profile", "logicx"], 
        ["country", "country the user comes from", "United States"],
        ["age", "age of the user", 30],
    ], 
     input_source
)

If we want to provide more fined grained controls, for example tuning hyper paramters, we can also introduce a CREATE FUNCTION, which allows us to have a key-value based configuration.

@gaurav274 @jiashenC Please provide feedback. Thanks.

Use case

No response

Are you willing to submit a PR?

pchunduri6 commented 10 months ago
[
    ["name", "name of the user profile", "logicx"], 
    ["country", "country the user comes from", "United States"]
], 
  1. How would this translate to the LLM prompt in the background -- e.g., one prompt for each column, single prompt by combining all columns
  2. The LLM extraction is brittle, so careful prompt engineering is required. Is it safe to use this structure without providing the option to engineer the prompt?
  3. With RAG queries, there is information loss, so accurate extraction will get trickier. Tracking the output accuracy could be challenging.
xzdandy commented 10 months ago

Hi @pchunduri6, very good feedback.

1) Backend will be translated to what we have similar in the stargazers. Optimization like batching will be applied accordingly. There are also new opportunities like merging, for example, some columns are extracted in predicate while some are extracted in projection.

GPT35Azure("You are given a block of disorganized text extracted from the GitHub user profile of a user using an automated web scraper. The goal is to get structured results from this data.
                Extract the following fields from the text: name, country, city, email, occupation, programming_languages, topics_of_interest, social_media.
                If some field is not found, just output fieldname: N/A. Always return all the 8 field names. DO NOT add any additional text to your output.
                The topic_of_interest field must list a broad range of technical topics that are mentioned in any portion of the text.  This field is the most important, so add as much information as you can. Do not add non-technical interests.
                The programming_languages field can contain one or more programming languages out of only the following 4 programming languages - Python, C++, JavaScript, Java. Do not include any other language outside these 4 languages in the output. If the user is not interested in any of these 4 programming languages, output N/A.
                If the country is not available, use the city field to fill the country. For example, if the city is New York, fill the country as United States.
                If there are social media links, including personal websites, add them to the social media section. Do NOT add social media links that are not present.
                Here is an example (use it only for the output format, not for the content):

                name: logicx
                country: United States
                city: Atlanta
                email: abc@gatech.edu
                occupation: PhD student at Georgia Tech
                programming_languages: Python, Java
                topics_of_interest: Google Colab, fake data generation, Postgres
                social_media: https://www.logicx.io, https://www.twitter.com/logicx, https://www.linkedin.com/in/logicx
                ", stargazerscrapeddetails.extracted_text
                )

2) I have exactly similar thoughts. We can provide a full prompt to the engineer. But non advanced user may not know how to write a proper prompt for this purpose. The proposed interface is more user friendly and simple. I agree it can lose some accuracy but power users can always write the above fully customized query in EvaDB. For this asepct, I am eager to see more feedback on the design.
3) Feedback on RAG is helpful. Is RAG useful for extracting column information? or when it will be useful, since the current stargazer does not use that. And it is also easier to implement the EXTRACT_COLUMNS without RAG. We need to evaluate the efforts and gains.

hershd23 commented 9 months ago

Hey @gaurav274 introduced me to this issue.

Seems interesting. Can I take it up?

xzdandy commented 9 months ago

HI @hershd23 , thanks for your interest! Yes!

hershd23 commented 9 months ago

https://github.com/hershd23/eva-structure-gpt

Have something up just as a quick and dirty POC. Mostly testing for the testing of the prompt which I build incrementally. I think this is good enough to start work on the function itself