AmenRa / retriv

A Python Search Engine for Humans 🥸
MIT License
174 stars 20 forks source link

Minimal example for Hybrid Search fails #20

Closed cnndabbler closed 1 year ago

cnndabbler commented 1 year ago

First, I really like this project !

Respective sparse and dense examples work with minimal setup.

Issue is with the hybrid mode.

Here is the code:

from retriv import HybridRetriever

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

hr = HybridRetriever(
    # Shared params ------------------------------------------------------------
    index_name="hybrid-index",
    # Sparse retriever params --------------------------------------------------
    sr_model="bm25",
    min_df=1,
    tokenizer="whitespace",
    stemmer="english",
    stopwords="english",
    do_lowercasing=True,
    do_ampersand_normalization=True,
    do_special_chars_normalization=True,
    do_acronyms_normalization=True,
    do_punctuation_removal=True,
    # Dense retriever params ---------------------------------------------------
    dr_model="sentence-transformers/multi-qa-MiniLM-L6-dot-v1",
    normalize=True,
    max_length=128,
    use_ann=True,
)

he = hr.index(collection)
he.search(
  query="witches",    # What to search for        
  return_docs=True,          # Default value, return the text of the documents
  cutoff=5,                # 100 is Default value, number of results to return
)

Error:

Building TDF matrix: 100%|██████████| 4/4 [00:01<00:00,  3.41it/s]
Building inverted index: 100%|██████████| 13/13 [00:00<00:00, 6786.90it/s]
Embedding documents: 100%|██████████| 4/4 [00:00<00:00, 206.63it/s]
Building ANN Searcher
100%|██████████| 1/1 [00:00<00:00, 20661.60it/s]
100%|██████████| 1/1 [00:00<00:00, 99.58it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /tmp/ipykernel_45461/1793453458.py:32 in <module>                                                │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_45461/1793453458.py'                        │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/hybrid_retrieve │
│ r.py:255 in search                                                                               │
│                                                                                                  │
│   252 │   │   """                                                                                │
│   253 │   │                                                                                      │
│   254 │   │   sparse_results = self.sparse_retriever.search(query, False, 1_000)                 │
│ ❱ 255 │   │   dense_results = self.dense_retriever.search(query, False, 1_000)                   │
│   256 │   │   hybrid_results = self.merger.fuse([sparse_results, dense_results])                 │
│   257 │   │   return (                                                                           │
│   258 │   │   │   self.prepare_results(                                                          │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/dense_retriever │
│ /dense_retriever.py:251 in search                                                                │
│                                                                                                  │
│   248 │   │   │   │   self.load_embeddings()                                                     │
│   249 │   │   │   doc_ids, scores = compute_scores(encoded_query, self.embeddings, cutoff)       │
│   250 │   │                                                                                      │
│ ❱ 251 │   │   doc_ids = self.map_internal_ids_to_original_ids(doc_ids)                           │
│   252 │   │                                                                                      │
│   253 │   │   return (                                                                           │
│   254 │   │   │   self.prepare_results(doc_ids, scores)                                          │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. │
│ py:87 in map_internal_ids_to_original_ids                                                        │
│                                                                                                  │
│    84 │   │   return results                                                                     │
│    85 │                                                                                          │
│    86 │   def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:            │
│ ❱  87 │   │   return [self.id_mapping[doc_id] for doc_id in doc_ids]                             │
│    88 │                                                                                          │
│    89 │   def save(self):                                                                        │
│    90 │   │   raise NotImplementedError()                                                        │
│                                                                                                  │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. │
│ py:87 in <listcomp>                                                                              │
│                                                                                                  │
│    84 │   │   return results                                                                     │
│    85 │                                                                                          │
│    86 │   def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]:            │
│ ❱  87 │   │   return [self.id_mapping[doc_id] for doc_id in doc_ids]                             │
│    88 │                                                                                          │
│    89 │   def save(self):                                                                        │
│    90 │   │   raise NotImplementedError()                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: -1
cnndabbler commented 1 year ago

ok, making the following change makes the code complete.

    use_ann=False,
AmenRa commented 1 year ago

Hi, thanks for the kind words.

I suspect the issue is that four docs are not enough to build clusters with faiss. Strangely, it works for the dense but not the hybrid retriever.

Also, did I report this example somewhere? I cannot find it in the documentation. :D

I know it is in the readme, but it was only intended for the sparse retriever.

In general, if you have less than 20k documents, it does not make sense to use approximate nearest neighbors.

AmenRa commented 1 year ago

Closing for inactivity. Feel free to re-open.