cube-js / cube

📊 Cube — The Semantic Layer for Building Data Applications
https://cube.dev
Other
17.73k stars 1.75k forks source link

Cube Operator for VectorSearch #8576

Open mauriciocirelli opened 1 month ago

mauriciocirelli commented 1 month ago

Dear community,

We have a demand for performing similarity searches on our database, mainly due to the AI demand. Users may ask questions with typos and a similarity operator would be way better than the traditional equality/contains operators.

Since most of popular dabatases like Postgres and Mongo have operators for performing Vector Similarity Searches, I think Cube users would benefit greatly from it.

Please, kindly consider this feature request.

igorlukanin commented 1 month ago

Hi @mauriciocirelli 👋

Thanks for filing this!

Users may ask questions with typos...

So, do I understand it correctly, that you'd like to be able to respond to full-fledged "questions" rather than, say, to filter on a single field in a query? For questions, does the AI API look like a good fit?

mauriciocirelli commented 1 month ago

Hi @mauriciocirelli 👋

Thanks for filing this!

Users may ask questions with typos... So, do I understand it correctly, that you'd like to be able to respond to full-fledged "questions" rather than, say, to filter on a single field in a query? For questions, does the AI API look like a good fit?

Hi @igorlukanin

We have been playing with this for almost an year so far, so the AI API wasnt available. Now it is on Beta and we are evaluating it. It works pretty well, but it does not fix typos on the queries like that. So a similarity comparator is still needed.

For instance, I may ask a question About "Iggor". The AI API will generate a query with a filter on Name equals to "Iggor", which would not match any records. The AI could be improved to use the similarity operator instead and create a query with a filter on Name similar to "Iggor", which would match "Igor" by similarity score. Or, we could use queryRewrite to change the equality comparison to a similarity comparison on the fly when the query comes from the AI - this is what we are doing right now, using a Similarity HTTP API we have designed. We could avoid using this custom HTTP API if Cube had a built-in similarity comparator.

EDIT:

We just need to figure out a way that those filters would still use the pre-aggregation caches. The approach we have been using so far replaces the values on the filter, but keeps the equality operator, so all queries still match the pre-aggregation caches.

It is important that this similarity operator still works with pre-aggregations. It may query the DB for the similar values, but the final query should still be able to match a pre-aggregation.

Ultimately, Cube could fetch the possible values periodically from the db (using a refresh-key or a special kind of pre-aggregation) and run the similarity search on this cache. This seems a nice approach as it would eliminate the need to write code to run similarity queries on all supported databases while also avoiding the overhead of hitting the db.