Open Aravind-Suresh opened 1 year ago
Glad to see there are others thinking about this as well.
I had recently created a short internal proposal on why a case can be made for vector storage and indexing in Pinot.
I think first thing we need to do is to get alignment / consensus within the community that it makes sense to do vector search in Pinot
This is our internal Description and Business Justification we created. @jasperjiaguo can add more info
Description
Vector embeddings are numerical coordinate (multi dimensional space) based representations typically resulting from a machine learning model training. For example training of LLM on text can produce billions of vector embeddings which are the distilled representation of text / words (training data). Goal is to build optimal storage, indexing and query execution capabilities for such kind of data.
Benefit / Use Case
Can be a crucial foundation for AI systems that can leverage high performance similarity indexing and analytics on vector embeddings for recommendation, image matching, pattern recognition, anomaly detection etc.
Specifically in the case of LLMs and prompt engineering pipeline - vector storage, indexing and querying can be used to store and query domain specific facts (that were created during training e.g neural network learning) which can then be fed into NLP models / ChatBots, Conversational Prompts etc
Would love to collaborate on this.
This is interesting. +1
Recommendation systems and Language Model (LLM) applications often utilize high-dimensional vector spaces to represent complex data like user profiles or linguistic patterns. Similarity-based vector indexing/search, a crucial element of these systems, identifies 'close' vectors in this space, signifying high similarity. This is commonly achieved through calculating the cosine similarity or Euclidean distance between vectors.
For instance, (1) in recommendation systems, items similar to a user's past interests are identified and suggested. (2) Meanwhile in LLM applications, instead of submitting a customer’s prompt directly to model, the question is first routed to the vector database (can be considered as the memory of the LLM), which will retrieve the top 10 or 15 most relevant documents for that query. The vector database then bundles those supporting documents with the user’s original question, submits the full package as the knowledge context prompt to the LLM, which returns more relevant answer. (https://mlops.community/combine-and-query-multiple-documents-with-llm/, https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/MilvusIndexDemo.html)
However, given the potentially vast number of vectors, searching for the most similar ones can be computationally challenging. Therefore, Approximate Nearest Neighbor (ANN) algorithms like FAISS, Annoy, or ScaNN are employed to expedite this process by quickly finding the nearest vectors in high-dimensional spaces.
https://milvus.io/docs/index.md
https://github.com/facebookresearch/faiss
https://www.datanami.com/2023/03/27/vector-databases-emerge-to-fill-critical-role-in-ai/
Thanks for the inputs @siddharthteotia @jasperjiaguo - yes, given the high dimensionality of the embeddings (OpenAI-davinci embeddings are >12k in dimensions), it's practical to use approximate algorithms.
In addition to recommendation systems and vector-search based prompts, there are also applications in semantic searches, clustering (grouping of related issues, text) as well.
We recently tried powering automated Q&A via vector-search (using vector search based prompts) and it achieves good precision on unstructured data input as well (we used langchain here - https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html)
Given that new features are being powered via embeddings (Glean's AI powered enterprise search is one recent example - https://www.glean.com/blog/unlocking-the-power-of-vector-search-in-enterprise), it would be good to evaluate how Pinot can support this in a real-time setup.
Looking forward to the collaboration here!
cc @KKcorps who is also thinking about it.
@Aravind-Suresh Exactly. I've also been using llama_index and langchain with chatgpt apis. I think one usability addition to this feature may be to integrate Pinot vector store with these python packages or provide similar powerful python libs. Here is a list of vector store llama_index supports: https://gpt-index.readthedocs.io/en/latest/how_to/integrations/vector_stores.html .
cc: @kkrugler
Here are some takes from my side: High level principals:
Considering the doc size in one segment is usually < 10MM, so I think any of current billion scale approach is sufficient for us.
In terms of implementation, here is just take an example of using SPTAG(https://github.com/microsoft/SPTAG), paper( https://arxiv.org/pdf/2111.08566.pdf). Definitely leverage existing libraries to no re-invent the wheel.
During Index build phase, we need to build per segment basis SPTAG index. Use hierarchical balanced clustering to generate a set of regions(centroids). We can configure below two parameters:
During Query phase: kNN search functionality should be able to configure:
IMO, CPU based solution would be too slow for vector search. The vector embeddings popular currently use 700 to 1536 length floating point arrays for a single object.
Computing similarity across million such object at runtime for indexing is quite compute heavy.
CPU solutions only make sense in certain scenarios IMO and I am not sure if those are fit.
specifically Pinot, i knew that most of the vector databases leverage "inverted index" mechanism to speed up the ANNS algorithm. i don't think that's identical to the inverted index we have in Pinot but we should see if the indexing framework after index-spi is introduce can be used.
Release video: Apache Pinot 1.1 | Overview of Latest Features and Updates talks also about vector index support brought by: Support Vector index and HNSW as the first implementation #11977
related to open pull request: Vector data type in Pinot https://github.com/apache/pinot/pull/11262
Creating this issue to initiate discussions about supporting vector embeddings in Pinot.
This write-up collates some initial thoughts about this. It isn't a design doc, we'll work on the design doc once we've a high-level alignment.