ChakshuGautam commented 1 year ago

Project Details

Text2SQL is an application that allows users to interact with their data using natural language queries. Currently, it only supports SQL-based querying but the implementation is not limited to that. Text2SQL provides APIs to generate the appropriate query (SQL or otherwise) and return the data you need.

Features to be implemented

Token Optimization

Improve token usage with OpenAI

Alternate Models Evaluation

Models to be evaluated

[ ] WikiSQL has a some models that can used
[ ] Spider and SparC has some more of these
[ ] RAT is a brilliant implementation of this
[ ] #36
[ ] #37

Domian Mapping to Schema

[ ] Solve for cases when the DB/Tables are not having intuitive names
[ ] Solve for cases where the data in a dataset is needed to figure out viable filters

Test Cases/Benchmarking

Add public test cases to test out the current model.

[ ] https://huggingface.co/datasets/wikisql

Learning Path

Complexity

Complex

Skills Required

Python, Knowledge of HuggingFace Transformers, NLP, SQL, Databases.

Name of Mentors:

@ChakshuGautam

Project size

8 Weeks

Product Set Up

See the setup here

Acceptance Criteria

[ ] Evaluation Matrix of Model vs Use Case
[ ] Solve for a single Education domain and test if on a new schema
[ ] Run test cases and update benchmarks
[ ] Token usage chart to be shared showing improvements on benchmarks with smaller prompts

C4GT

This issue is nominated for Code for GovTech (C4GT) 2023 edition. C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/

HemanthSai7 commented 1 year ago

Hello @ChakshuGautam , I am interested in contributing to this project. Could you please clarify the feature Alternate Model Evaluation? Does it mean trying out the models used in WikiSQL etc and reporting the results?

ChakshuGautam commented 1 year ago

Yes @HemanthSai7. But not with their data - it needs to be a complete cycle of training the model for a domain and seeing the results. Can you start with creating test data first? Also we don't need to do for all, just some promising ones. I am looking at 3 max based on literature review - the ones that have been evolved the most.

HemanthSai7 commented 1 year ago

Okay, I'll start by reading the wikiSQL paper. Can you please elaborate more on the test data?

fibonacci35813 commented 1 year ago

Hey @ChakshuGautam ! I am looking forward to contributing in this project, it would be helpful if you could guide me in the initial phase, that would be helpful.

ChakshuGautam commented 1 year ago

Hey guys, let me break this down further and share it by EoD today.

ManasaKaza commented 1 year ago

@ChakshuGautam waiting for the information and all the details

dixitdeeksha commented 1 year ago

Hey @ChakshuGautam !

I'm eager to contribute to the Text2SQL project and would appreciate your guidance in the initial phase. My experience in Python, SQL, Django, and SQLAlchemy, along with my research work, will be valuable assets. Looking forward to your assistance.

rishabhv471 commented 1 year ago

hey @ChakshuGautam I am looking forward to contributing in this project, it would be helpful if you could guide me in the initial phase, that would be helpful.

suyashgautam commented 1 year ago

Hey @rishabhv471 , You can start by setting the project up in a Gitpod environment or in your local. For Gitpod you can follow my video. For local setup you can you can follow the readme. If you face any issues or if you have any question you can ping in the discord channel or you ping me. Will be happy to help. Looking forward to your contribution.

prajak002 commented 1 year ago

Hey @ChakshuGautam , i have gone through the requirements that will be implemented in our coming mentorship program ,

basically i am dividing my solution approach into two parts : -

1. dealing with databases where tables or columns do not have intuitive names, and when the data in a dataset is needed to figure out viable filters, you can employ techniques like domain mapping and data exploration. Here's an approach to address these scenarios:

Domain Mapping to Schema:

Create a mapping or knowledge base that links the non-intuitive names in the database to their corresponding domains or concepts. Analyze the data and schema to understand the meaning and context of the non-intuitive names. 2.Use this mapping during the NLP-based query generation process to convert user queries involving non-intuitive names into appropriate SQL statements.

-Data Exploration for Viable Filters:

Analyze the dataset and explore the data to understand its structure, relationships, and available filters.
Identify the relevant columns or attributes in the dataset that can be used as filters.

Integrate this filter catalog with the NLP-based query generation process, allowing users to interactively select or suggest filters from the available options.

-Test Cases/Benchmarking: To test the effectiveness of your model, you can leverage the WikiSQL dataset available on Hugging Face. i have uploaded my more detailed approch in unstop portal , the github repo link was there

2. Improve token usage with OpenAI:

Token Count Monitoring: Keep track of the token count in your input text using OpenAI's tiktoken library or similar tools. This helps to estimate the token usage and manage it effectively.
Experiment with Model Parameters: Adjust the max_tokens parameter in the API call to set a specific token limit. By setting a lower value, we can ensure our requests stay within the desired token budget.

`import openai import tiktoken

def optimize_tokens(text, max_tokens):

Calculate the initial token count

initial_tokens = tiktoken.count(text)

if initial_tokens <= max_tokens:
    # If the text is already within the token limit, return it as-is
    return text

# Shorten the text while preserving the meaning
shortened_text = text[:tiktoken.find(text, max_tokens - 3)] + "..."

# Adjust the shortened text to account for complete tokens
token_diff = initial_tokens - tiktoken.count(shortened_text)
shortened_text = text[:tiktoken.find(text, max_tokens - 3 + token_diff)] + "..."

return shortened_text

Example usage

openai.api_key = "YOUR_API_KEY"

input_text = """ This is a very long text that exceeds the token limit of the language model. We need to optimize the tokens to fit within the maximum allowed tokens. """

max_token_limit = 100

optimized_text = optimize_tokens(input_text, max_token_limit) print("Optimized Text:", optimized_text)

Make an API call with the optimized text

response = openai.Completion.create( engine="text-davinci-003", prompt=optimized_text, max_tokens=max_token_limit )

print("Response:", response.choices[0].text) ` I am looking forward to contributing more in this project, i will be blessed if you could guide me in the initial phase, that would be helpful.

AmanGadadare commented 1 year ago

@ChakshuGautam sir submitted the proposal, looking forward to working under you and contributing to this wonderful project

ChakshuGautam commented 1 year ago

Hey guys I am deleting the non-solution related messaged from here.

Samagra-Development / Text2SQL

[C4GT] Performance, Cost Optimization, Benchmarking #28

Project Details

Features to be implemented

Token Optimization

Alternate Models Evaluation

Domian Mapping to Schema

Test Cases/Benchmarking

Learning Path

Complexity

Skills Required

Name of Mentors:

Project size

Product Set Up

Acceptance Criteria

C4GT

1. dealing with databases where tables or columns do not have intuitive names, and when the data in a dataset is needed to figure out viable filters, you can employ techniques like domain mapping and data exploration. Here's an approach to address these scenarios:

2. Improve token usage with OpenAI:

Calculate the initial token count

Example usage

Make an API call with the optimized text