NeumTry / NeumAI

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
https://neum.ai
Apache License 2.0
821 stars 47 forks source link

Structured Search Pipeline #55

Open ddematheu opened 9 months ago

ddematheu commented 9 months ago

Querying requirements across RAG fall not only onto unstructured data that has been embedded and added to an vector database. It also falls onto structured data sources where semantic search doesn't really make sense.

Goal: Provide a pipeline interface that connects to a structured data source and generates queries in real-time based on queries.

Implementation:

Alternative implementation:

sky-2002 commented 8 months ago

@ddematheu Haven't yet fully understood this, but the alternatives sound similar to internals of this project - aidb. Can you please give an example to elaborate this. As far as I understood, we have some structured data sources. Now we want to map a natural language query to an appropriate SQL query(or any structured query) using an LLM.

ddematheu commented 8 months ago

The thought process was given a database, to generate a set of common queries for it (based on schema) using an LLM. Fron there take the queries amd the descriptions for them and embed them (embed the description). Then at runtine when someone searches, we take he search and compare against the embeddings and use the stored query to query the database (or pass into a database for fine tuning based on the search)

It is a bit more similar to this https://github.com/vanna-ai/vanna.

sky-2002 commented 8 months ago

@ddematheu Okay, so I understood it like this and tried it on t5-small-text-2-sql model:

input_prompt = '''
tables:\n CREATE TABLE engineers (id: VARCHAR, name: TEXT, age: INT); \n 
query for: Group by the age 'column' 
'''
print("Generted SQL:")
generate_sql(input_prompt=input_prompt)

Output:

Generted SQL:
'SELECT name, age FROM engineers GROUP BY age'

So we would create pairs and embed the description,

{'query': 'SELECT name, age FROM engineers GROUP BY age', 'description': 'Group by the age column'}

Is this what you meant?

Update: Also tried with a small cpu-ready LLM