Vector embeddings support in Pinot

Aravind-Suresh commented 1 year ago

Creating this issue to initiate discussions about supporting vector embeddings in Pinot.

This write-up collates some initial thoughts about this. It isn't a design doc, we'll work on the design doc once we've a high-level alignment.

siddharthteotia commented 1 year ago

Glad to see there are others thinking about this as well.

I had recently created a short internal proposal on why a case can be made for vector storage and indexing in Pinot.

I think first thing we need to do is to get alignment / consensus within the community that it makes sense to do vector search in Pinot

This is our internal Description and Business Justification we created. @jasperjiaguo can add more info

Description

Vector embeddings are numerical coordinate (multi dimensional space) based representations typically resulting from a machine learning model training. For example training of LLM on text can produce billions of vector embeddings which are the distilled representation of text / words (training data). Goal is to build optimal storage, indexing and query execution capabilities for such kind of data.

Benefit / Use Case

Can be a crucial foundation for AI systems that can leverage high performance similarity indexing and analytics on vector embeddings for recommendation, image matching, pattern recognition, anomaly detection etc.

Specifically in the case of LLMs and prompt engineering pipeline - vector storage, indexing and querying can be used to store and query domain specific facts (that were created during training e.g neural network learning) which can then be fed into NLP models / ChatBots, Conversational Prompts etc

siddharthteotia commented 1 year ago

Would love to collaborate on this.

abhioncbr commented 1 year ago

This is interesting. +1

jasperjiaguo commented 1 year ago

Recommendation systems and Language Model (LLM) applications often utilize high-dimensional vector spaces to represent complex data like user profiles or linguistic patterns. Similarity-based vector indexing/search, a crucial element of these systems, identifies 'close' vectors in this space, signifying high similarity. This is commonly achieved through calculating the cosine similarity or Euclidean distance between vectors.

For instance, (1) in recommendation systems, items similar to a user's past interests are identified and suggested. (2) Meanwhile in LLM applications, instead of submitting a customer’s prompt directly to model, the question is first routed to the vector database (can be considered as the memory of the LLM), which will retrieve the top 10 or 15 most relevant documents for that query. The vector database then bundles those supporting documents with the user’s original question, submits the full package as the knowledge context prompt to the LLM, which returns more relevant answer. (https://mlops.community/combine-and-query-multiple-documents-with-llm/, https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/MilvusIndexDemo.html)

However, given the potentially vast number of vectors, searching for the most similar ones can be computationally challenging. Therefore, Approximate Nearest Neighbor (ANN) algorithms like FAISS, Annoy, or ScaNN are employed to expedite this process by quickly finding the nearest vectors in high-dimensional spaces.

https://milvus.io/docs/index.md

https://github.com/facebookresearch/faiss

https://www.datanami.com/2023/03/27/vector-databases-emerge-to-fill-critical-role-in-ai/

https://github.com/linkedin/venice#read-compute

Aravind-Suresh commented 1 year ago

Thanks for the inputs @siddharthteotia @jasperjiaguo - yes, given the high dimensionality of the embeddings (OpenAI-davinci embeddings are >12k in dimensions), it's practical to use approximate algorithms.

In addition to recommendation systems and vector-search based prompts, there are also applications in semantic searches, clustering (grouping of related issues, text) as well.

We recently tried powering automated Q&A via vector-search (using vector search based prompts) and it achieves good precision on unstructured data input as well (we used langchain here - https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html)

Given that new features are being powered via embeddings (Glean's AI powered enterprise search is one recent example - https://www.glean.com/blog/unlocking-the-power-of-vector-search-in-enterprise), it would be good to evaluate how Pinot can support this in a real-time setup.

Looking forward to the collaboration here!

kishoreg commented 1 year ago

cc @KKcorps who is also thinking about it.

jasperjiaguo commented 1 year ago

@Aravind-Suresh Exactly. I've also been using llama_index and langchain with chatgpt apis. I think one usability addition to this feature may be to integrate Pinot vector store with these python packages or provide similar powerful python libs. Here is a list of vector store llama_index supports: https://gpt-index.readthedocs.io/en/latest/how_to/integrations/vector_stores.html .

xiangfu0 commented 1 year ago

cc: @kkrugler

xiangfu0 commented 1 year ago

Here are some takes from my side: High level principals:

CPU solution
KNN search has to be distributable
The minimal search space is considered within one segment level(10-100MM rows/points)
Pluggable index structure along with the search algorithm

Considering the doc size in one segment is usually < 10MM, so I think any of current billion scale approach is sufficient for us.

In terms of implementation, here is just take an example of using SPTAG(https://github.com/microsoft/SPTAG), paper( https://arxiv.org/pdf/2111.08566.pdf). Definitely leverage existing libraries to no re-invent the wheel.

During Index build phase, we need to build per segment basis SPTAG index. Use hierarchical balanced clustering to generate a set of regions(centroids). We can configure below two parameters:

Number of regions or the percentage of total points are centroids(number of regions). From paper, 16% for best for search performance and memory usage
Replicas for a vector assigned to multiple closed clusters, larger number means better recall but search requires more resources and longer latency. From paper, 8 is best to balance perf and latency. Need to use RNG algorithm to avoid the high similarity of posting list for close regions

During Query phase: kNN search functionality should be able to configure:

k(required), which is how many results to fetch,
t(optional), a percent number to include more regions to search based on the distance to the closest centroids, this will increase the recall rate but still keep low resources usage

KKcorps commented 1 year ago

IMO, CPU based solution would be too slow for vector search. The vector embeddings popular currently use 700 to 1536 length floating point arrays for a single object.

Computing similarity across million such object at runtime for indexing is quite compute heavy.

walterddr commented 1 year ago

CPU solutions only make sense in certain scenarios IMO and I am not sure if those are fit.

Q: can it perform significantly better in specific use cases, for example ANNS use cases that the setup GPU & I/O overhead outweighs the batch performance benefit on the GPU.
Q: can we use an algorithm that doesn't depend on product quantization (or any that specifically designed to leverage the large parallelism of GPU but not so good with branch prediction)
- for example graph-based approaches ?
- this also echoes back to Q1 b/c most likely these branching algorithm are not good for batching
Q: would we perform significantly cheaper while still maintain the equal amount of performance? and is there a use cases similar to that (for example ad-hoc exploration of the dataset before massively scaled up when GPU is justify)

specifically Pinot, i knew that most of the vector databases leverage "inverted index" mechanism to speed up the ANNS algorithm. i don't think that's identical to the inverted index we have in Pinot but we should see if the indexing framework after index-spi is introduce can be used.

PeterCorless commented 3 months ago

See PR#11977

Also release notes for Apache Pinot 1.1

hpvd commented 3 months ago

Release video: Apache Pinot 1.1 | Overview of Latest Features and Updates talks also about vector index support brought by: Support Vector index and HNSW as the first implementation #11977

https://www.youtube.com/watch?v=wSwPtOajsGY&t=1m20s

hpvd commented 1 month ago

related to open pull request: Vector data type in Pinot https://github.com/apache/pinot/pull/11262

apache / pinot

Vector embeddings support in Pinot #10919