[August 23, 2023] retriv
0.2.2 is out!
This release adds experimental support for multi-field documents and filters.
Please, refer to Advanced Retriever documentation.
[February 18, 2023] retriv
0.2.0 is out!
This release adds support for Dense and Hybrid Retrieval.
Dense Retrieval leverages the semantic similarity of the queries' and documents' vector representations, which can be computed directly by retriv
or imported from other sources.
Hybrid Retrieval mix traditional retrieval, informally called Sparse Retrieval, and Dense Retrieval results to further improve retrieval effectiveness.
As the library was almost completely redone, indices built with previous versions are no longer supported.
retriv is a user-friendly and efficient search engine implemented in Python supporting Sparse (traditional search with BM25, TF-IDF), Dense (semantic search) and Hybrid retrieval (a mix of Sparse and Dense Retrieval). It allows you to build a search engine in a single line of code.
retriv is built upon Numba for high-speed vector operations and automatic parallelization, PyTorch and Transformers for easy access and usage of Transformer-based Language Models, and Faiss for approximate nearest neighbor search. In addition, it provides automatic tuning functionalities to allow you to tune its internal components with minimal intervention.
All the supported retrievers share the same search interface:
retriv automatically tunes Faiss configuration for approximate nearest neighbors search by leveraging AutoFaiss to guarantee 10ms response time based on your available hardware. Moreover, it offers an automatic tuning functionality for BM25's parameters, which require minimal user intervention. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one. Finally, it can automatically balance the importance of lexical and semantic relevance scores computed by the Hybrid Retriever to maximize retrieval effectiveness.
python>=3.8
pip install retriv
# Note: SearchEngine is an alias for the SparseRetriever
from retriv import SearchEngine
collection = [
{"id": "doc_1", "text": "Generals gathered in their masses"},
{"id": "doc_2", "text": "Just like witches at black masses"},
{"id": "doc_3", "text": "Evil minds that plot destruction"},
{"id": "doc_4", "text": "Sorcerer of death's construction"},
]
se = SearchEngine("new-index").index(collection)
se.search("witches masses")
Output:
[
{
"id": "doc_2",
"text": "Just like witches at black masses",
"score": 1.7536403
},
{
"id": "doc_1",
"text": "Generals gathered in their masses",
"score": 0.6931472
}
]
Would you like to see other features implemented? Please, open a feature request.
Would you like to contribute? Please, drop me an e-mail.
retriv is an open-sourced software licensed under the MIT license.