bclavie / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
2.45k stars 173 forks source link

Discrepancy in CPU Inference Latency: Cross-Encoder MiniLM Models vs. ColBERT #190

Open gaceladri opened 2 months ago

gaceladri commented 2 months ago

Greetings :wave:

I've been benchmarking the CPU inference latency for various models and observed some significant differences. Specifically, I'm comparing the performance of the sentence_transformers 'cross-encoder/ms-marco-MiniLM-L-12-v2' with other models. The latency for the top 10 re-ranking seems to vary quite a bit, and I'm trying to understand if this is an expected behavior or if there might be an issue with my setup. For clarity, here's a quick summary of the latencies I've recorded:

image (5)

Could someone please shed some light on this? Is there a particular reason why the ColBERT model has over double the latency of MiniLM-L-12-v2? Any insights or suggestions for improving the inference speed of ColBert on CPU would be greatly appreciated.

bclavie commented 2 months ago

I think this isn't shocking given how reranking with ColBERT works, though I'd expect it to be a bit quicker. The main factor at play here is that MiniLM-L-12 is just ~35M parameters (L-6 is around 22M for comparison), whereas ColBERTv2 is 110, so a bit more than 3 times as big, which'll explain why it runs a lot slower comparatively.

The strenght of ColBERT as a reranker however is that you can pre-compute document representations in advance, which you cannot do with cross-encoders. In such a set-up it'd run noticeably quicker than cross-enc alternatives!