Canner / WrenAI

🚀 An open-source SQL AI (Text-to-SQL) Agent that empowers data, product teams to chat with their data. 🤘
https://getwren.ai/oss
GNU Affero General Public License v3.0
2.04k stars 211 forks source link

chore(wren-ai-service): retrieval improvement #599

Closed cyyeh closed 2 months ago

cyyeh commented 3 months ago

indexing pipeline:

  1. 3 collections: db_schema, table_descriptions, view_questions
  2. to solve llm token window limit issue for indexing, we have a new env called COLUMN_INDEXING_BATCH_SIZE which users can decide how many columns to index in one document at one time

retrieval pipeline:

  1. select top 10(TABLE_RETRIEVAL_SIZE) tables based on table name and table descriptions (table_descriptions collection)
  2. select top 1000(TABLE_COLUMN_RETRIEVAL_SIZE) tables and columns based on previous results (db_schma)
  3. use llm to choose which tables and columns are needed to answer the question

we also expose two env vars for table and column selection: TABLE_RETRIEVAL_SIZE and TABLE_COLUMN_RETRIEVAL_SIZE