HKUSTDial / NL2SQL_Handbook

This is a continuously updated handbook for readers to easily track the latest NL2SQL (Text2SQL) techniques in the literature and provide practical guidance for researchers and practitioners.
https://arxiv.org/abs/2408.05109
218 stars 11 forks source link
awesome finetuning llms nl-to-code nl-to-sql nl2sql nlp nlp-resources survey text-to-sql text2sql tutorial

NL2SQL Handbook

From this repository, you can view the latest advancements in NL2SQL. This handbook corresponds to our survey paper: A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?. We also provide tutorial slides to summarize the key points of this survey. Based on the trends in the development of language models, we have created a river diagram of NL2SQL methods to trace the evolution of the NL2SQL field.

If you are a novice, don't worry—we have prepared a practical guide for you, covering a wide range of foundational materials here. We summarized NL2SQL related applications.

@misc{liu2024surveynl2sqllargelanguage,
      title={A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?}, 
      author={Xinyu Liu and Shuyu Shen and Boyan Li and Peixian Ma and Runzhi Jiang and Yuyu Luo and Yuxin Zhang and Ju Fan and Guoliang Li and Nan Tang},
      year={2024},
      eprint={2408.05109},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2408.05109}, 
}

🧭 NL2SQL Introduction

Translating users' natural language queries (NL) into SQL queries can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of NL2SQL has been greatly improved with the emergence of language models (LMs). In this context, it is crucial to assess our current position, determine the NL2SQL solutions that should be adopted for specific scenarios by practitioners, and identify the research topics that researchers should explore next.

📈 NL2SQL Lifecycle

🤔 Where Are We?

we categorize the challenges of NL2SQL into five levels, each addressing specific hurdles. The first three levels cover challenges that have been or are currently being addressed, reflecting the progressive development of NL2SQL. The fourth level represents the challenges we aim to tackle in the LLMs stage, while the fifth level outlines our vision for NL2SQL system in the next five years.

We describe the evolution of NL2SQL solutions from the perspective of language models, categorizing it into four stages. For each stage of NL2SQL, we analyze the changes in target users and the extent to which challenges are addressed.

🧩 Module-based NL2SQL Methods

We summarize the key modules of NL2SQL solutions utilizing the language model.

📚 NL2SQL Survey & Tutorial

  1. A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?
  2. Next-generation databas interfaces: A survey of llm-based text-to-sql.
  3. Large Language Model Enhanced Text-to-SQL Generation: A Survey.
  4. From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems.
  5. A Survey on Employing Large Language Models for Text-to-SQL Tasks.
  6. Natural language interfaces for tabular data querying and visualization: A survey.
  7. Natural Language Interfaces for Databases with Deep Learning.
  8. A survey on deep learning approaches for text-to-SQL.
  9. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect.
  10. A Deep Dive into Deep Learning Approaches for Text-to-SQL Systems.
  11. State of the Art and Open Challenges in Natural Language Interfaces to Data.
  12. Natural language to SQL: Where are we today?

📰 NL2SQL Paper List

  1. The Dawn of Natural Language to SQL: Are We Fully Ready?
  2. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation.
  3. Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation.
  4. Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models.
  5. ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems.
  6. CodeS: Towards Building Open-source Language Models for Text-to-SQL.
  7. FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis.
  8. PURPLE: Making a Large Language Model a Better SQL Writer.
  9. METASQL: A Generate-then-Rank Framework for Natural Language to SQL Translation.
  10. Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning.
  11. Synthesizing Text-to-SQL Data from Weak and Strong LLMs.
  12. Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark.
  13. I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation.
  14. PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL.
  15. Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning.
  16. Data-Centric Text-to-SQL with Large Language Models.
  17. CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL.
  18. Towards Optimizing SQL Generation via LLM Routing.
  19. XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL.
  20. E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL.
  21. DB-GPT: Empowering Database Interactions with Private Large Language Models.
  22. The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models.
  23. DBCopilot: Scaling Natural Language Querying to Massive Databases.
  24. CHESS: Contextual Harnessing for Efficient SQL Synthesis.
  25. PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency.
  26. CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions.
  27. AMBROSIA: A Benchmark for Parsing Ambiguous Questions into Database Queries.
  28. Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning.
  29. CatSQL: Towards Real World Natural Language to SQL Applications.
  30. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction.
  31. Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL.
  32. ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought.
  33. Selective Demonstrations for Cross-domain Text-to-SQL.
  34. RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL.
  35. Graphix-T5: Mixing Pre-trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing.
  36. Improving Generalization in Language Model-based Text-to-SQL Semantic Parsing: Two Simple Semantic Boundary-based Techniques.
  37. G3R: A Graph-Guided Generate-and-Rerank Framework for Complex and Cross-domain Text-to-SQL Generation.
  38. Importance of Synthesizing High-quality Data for Text-to-SQL Parsing.
  39. Know What I don’t Know: Handling Ambiguous and Unknown Questions for Text-to-SQL.
  40. C3: Zero-shot Text-to-SQL with ChatGPT
  41. MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL.
  42. SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation.

📊 NL2SQL Benchmark

We create a timeline of the benchmark's development and mark relevant milestones. You can get more details from this chapter: 📊 Benchmark

🎯 Where Are We Going?

📖 Catalog for Our Survey

You can get more information from our subsection. We introduce representative papers on related concepts:

💾 Practical Guide for Novice

📊 How to get data:

🛠️ How to build an LLM-based NL2SQL model:

🔎How to evaluate your model:

We collect NL2SQL evaluation metrics for you. You can get more details from this chapter: Evaluation

🗺️ Roadmap and Decision Flow

You can get some inspiration from the Roadmap and Decision Flow.

📱 NL2SQL Related Applications: