beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.54k stars 182 forks source link

Title needed when using retriever.retrieve after using GenericDataLoader(...)load_custom(...) #78

Open obelhumeur opened 2 years ago

obelhumeur commented 2 years ago

I have noticed an error when using EvaluateRetrieval.retrieve(model) when using the exact_search combined load_custom() method from the GenericDataLoader.

https://github.com/UKPLab/beir/blob/main/beir/datasets/data_loader.py -> GenericDataLoader: It says in the document that the title is optional, but when feeding a custom corpus without title, it goes through the _load_corpus() method where the following code is executed:

            self.corpus[line.get("_id")] = {
                "text": line.get("text"),
                "title": line.get("title"),
            }

more precisely: "title": line.get("title"),

By default, if the title is not present, it returns a None.

Then in the following file: https://github.com/UKPLab/beir/blob/main/beir/retrieval/evaluation.py

It uses the method retrieve, which uses the search method:

return self.retriever.search(corpus, queries, self.top_k, self.score_function, **kwargs)

When using search.dense.exact_search.py as the retriever, the following code is executed:

corpus_ids = sorted(corpus, key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")), reverse=True)

The problem is the following: When using GenericDataLoader(...)load_custom(...), if there is no title field, It creates a None type as title, then when trying to run this code :

corpus[k].get("title", "") + corpus[k].get("text", "")

It returns a TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

What I suggest to resolve this problem (It worked in my case) is replacing the code like the following, for the _load_corpus() method:

            self.corpus[line.get("_id")] = {
                "text": line.get("text"),
                "title": line.get("title", ""),
            }

It creates a string, then it is possible to run the following code:

corpus[k].get("title", "") + corpus[k].get("text", "")

Olivier

buoi commented 1 year ago

I stumbled upon the same problem and totally support this. To make the code work I modified my dataset to have "title": "" for each corpus entry.

However, I find it ambiguous that "title" is used by default in evaluation if present while default behavior with no title will be to use the "text" only (after the proposed fix). I would also add a parameter to enable or disable title by default. It should probably go in the SentenceBERT.encode_corpus(use_title_if_present)