Title needed when using retriever.retrieve after using GenericDataLoader(...)load_custom(...)

I have noticed an error when using EvaluateRetrieval.retrieve(model) when using the exact_search combined load_custom() method from the GenericDataLoader.

https://github.com/UKPLab/beir/blob/main/beir/datasets/data_loader.py -> GenericDataLoader: It says in the document that the title is optional, but when feeding a custom corpus without title, it goes through the _load_corpus() method where the following code is executed:

            self.corpus[line.get("_id")] = {
                "text": line.get("text"),
                "title": line.get("title"),
            }

more precisely: "title": line.get("title"),

By default, if the title is not present, it returns a None.

Then in the following file: https://github.com/UKPLab/beir/blob/main/beir/retrieval/evaluation.py

It uses the method retrieve, which uses the search method:

return self.retriever.search(corpus, queries, self.top_k, self.score_function, **kwargs)

When using search.dense.exact_search.py as the retriever, the following code is executed:

corpus_ids = sorted(corpus, key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")), reverse=True)

The problem is the following: When using GenericDataLoader(...)load_custom(...), if there is no title field, It creates a None type as title, then when trying to run this code :

corpus[k].get("title", "") + corpus[k].get("text", "")

It returns a TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

What I suggest to resolve this problem (It worked in my case) is replacing the code like the following, for the _load_corpus() method:

            self.corpus[line.get("_id")] = {
                "text": line.get("text"),
                "title": line.get("title", ""),
            }

It creates a string, then it is possible to run the following code:

corpus[k].get("title", "") + corpus[k].get("text", "")

Olivier

beir-cellar / beir

Title needed when using retriever.retrieve after using GenericDataLoader(...)load_custom(...) #78