beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Sparse query vector #63

Open maximedb opened 2 years ago

maximedb commented 2 years ago

This PR builds upon #62.

It refactors the sparse search to represent queries and documents as CSR matrices. The SPARTA model is updated to fit this setup.

It also adds a clean SPLADE model along with an eval code. The SPLADE authors used a DenseRetrievalExactSearch in their demo script, but as SPLADE is labeled as a sparse model it should use a SparseSearch in my opinion. The results are not directly comparable as it uses the co-condenser instead of distilbert as base model. I could not find a URL to download the link of the original model.

Maxime.

cadurosar commented 2 years ago

Hi Maxime,

Thanks for integrating SPLADE using CSR matrices! I will be running it on my side and will let you know if it matches the numbers we have dataset by dataset (for the CoCodenser version).

For the "original" SPLADE model, it is available here: https://github.com/naver/splade/tree/main/weights/distilsplade_max, but as individual files. We are working into making it available as a tar.gz as well.

maximedb commented 2 years ago

Hi Carlos,

Thanks for integrating SPLADE using CSR matrices! I will be running it on my side and will let you know if it matches the numbers we have dataset by dataset (for the CoCodenser version).

Nice, thanks!

For the "original" SPLADE model, it is available here: https://github.com/naver/splade/tree/main/weights/distilsplade_max, but as individual files. We are working into making it available as a tar.gz as well.

Uploading the model on the HuggingFace hub would also be possible (and easier to download).

thakur-nandan commented 2 years ago

Hi @maximedb,

Thanks again for making use of the CSR matrices for SPLADE. I would have a look at the PR and merge it with beir soon.

A mention of a side project of mine: Sparse Retrieval (https://github.com/NThakur20/sparse-retrieval). We are currently developing a ready-to-use toolkit for efficient training and inference of all neural sparse retrieval models such as SPLADE, SPARTA, uniCOIL, TILDE, and DeepImpact. The implementation of SPLADE with CSR matrices works. However, we find it better and more efficient to use an Inverted index such as Pyserini. The project is planned to come out by end of February! We will keep you updated soon.

We will reproduce the various sparse baselines and additionally upload the models on HF.

Kind Regards, Nandan Thakur

maximedb commented 2 years ago

Really cool stuff! The multi-gpu encoding is a super cool feature :-)

cadurosar commented 2 years ago

Uploading the model on the HuggingFace hub would also be possible (and easier to download).

We would love to, but we are still seeing internally how we can do it. Here's a link for the "original model" in the same way as the new ones: https://download-de.europe.naverlabs.com/Splade_Release_Jan22/distilsplade_max.tar.gz

A mention of a side project of mine: Sparse Retrieval (https://github.com/NThakur20/sparse-retrieval). We are currently developing a ready-to-use toolkit for efficient training and inference of all neural sparse retrieval models such as SPLADE, SPARTA, uniCOIL, TILDE, and DeepImpact. The implementation of SPLADE with CSR matrices works. However, we find it better and more efficient to use an Inverted index such as Pyserini. The project is planned to come out by end of February! We will keep you updated soon.

It looks really cool Nandan, I've starred and will keep an eye on it :)