beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

BaseException with rerank - 'dict' object has no attribute 'strip' #53

Closed pablogranolabar closed 2 years ago

pablogranolabar commented 2 years ago

Hi there,

I've been working on a dense IR pipeline with BEIR including a custom dataloader, which works fine for dense IR runs but throws an exception whenever I add a cross encoder for reranking.

Rerank:

cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-electra-base')
reranker = Rerank(cross_encoder_model, batch_size=128)

Dataloader:

corpus = {}

for index, item in corpusdf.iteritems():
    corpus.update({
        "doc"+(str(index)): {
            "title": "",
            "text": item,
            },
    })

queries = {}

for index, row in queriesdf.iterrows():
    queries.update({
        "q"+str(index): {
            "doc"+(str(index)): row[0],
            },
    })

qrels = {}

for i in range(len(df)):
    qrels.update({
        "q"+str(i): {
            "doc"+(str(i)): 1,
        },
    })

Exception:

Traceback (most recent call last):
  File "C:\Users\costco\venv\lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py", line 273, in predict
    for features in iterator:
  File "C:\Users\costco\venv\lib\site-packages\tqdm\std.py", line 1180, in __iter__
    for obj in iterable:
  File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "C:\Users\costco\venv\lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py", line 93, in smart_batching_collate_text_only
    texts[idx].append(text.strip())
AttributeError: 'dict' object has no attribute 'strip'

Seems like a simple fix but I am trying to avoid modifying BEIR sources, any ideas would be greatly appreciated!